Internal Parallelism Performance Bounds - Modern Optimization Algorithms and Applications: Arch

Linear programming codes are large and complex with many modular components applied sequentially. Improving a sequential code-base of this type with parallel resources requires

detailed profiling in order to develop an effective strategy. The algorithms that dominate the overall runtime of the solver should be targeted in order to maximize the effects of parallelism. This section presents the methodology used to identify the bounds on internal parallelism by profiling a popular open source linear programming code, SoPlex. SoPlex is an open-source linear programming code that is part of a large operations research code suite [3]. The code is written in C++and follows an object-oriented design approach. The software is capable of solving large sparse problems.

4.3.1 Profiling Methodology

SoPlex was subjected to profiling to illuminate the detailed components of the solution process. The process for each problem was profiled with the perf profiling tool on a Linux system that contained an Intel i7-4930k Ivy-Bridge Processor and 32GB of RAM. The solver was compiled with its default settings for maximum performance as provided by the documentation, terminated if it had not found a solution after one thousand seconds, and run with the default command line arguments for algorithm parameters. The time limit is used as the metric of failure for the solver, which is enough time to solve any of the test cases provided. The profiler measured the obtained objective function, number of iterations, and runtime.

Sample linear programming problems compiled from the Netlib, Mittleman and other linear programming and mixed integer programming databases [9, 10, 11] formed the set of profiled problems. The linear relaxations of the mixed integer problems were solved rather than the full problem. For each problem the time spent in each function was measured and the total time spent in that function for solving the entire test set was calculated. Table B.1 contains the full directory of test cases.

4.3.2 Computing Performance Limits

Figure 4.1 presents the cumulative profiling results from SoPlex on the test cases. The largest contributors to the runtime of the solver are the sparse matrix vector multiplication,setupPUp- date and triangular solve, solveUleftNoNZ functions. The names of the functions are drawn directly from the SoPlex source code. Combined the two algorithms contribute approximately

4.3. InternalParallelismPerformanceBounds 69 11% 11% 11% 10% 6% 5% 5% 3% 3% 2% 2% 2% 2% 1% 1% 1% 1% 1% 23% SPxSolver::setupPupdate CLUFactor::solveUleftNoNZ deQueueMin CLUFactor::vSolveUright SPxSteepPR::entered4 CLUFactor::vSolveUrightNoNZ __strcmp_sse42 deQueueMax SPxSolver::updateTest NameSet::number SPxFastRT::maxDelta CLUFactor::vSolveLeft2 CLUFactor::forestUpdate CLUFactor::vSolveLeft SPxFastRT::selectLeave CLUFactor::solveLleft CLUFactor::vSolveLright2 SPxSolver::doPupdate Other

Figure 4.1: Distribution of algorithm runtime in SoPlex for the selected test problems

22% of the runtime of the algorithm. Other algorithms that contribute large time consumption are several other triangular solve algorithms, the pricing calculation entered4 and dequeuing elements from a heap withdeQueueMin. Other algorithms, which each individually account for less than one percent of the runtime, contribute a combined twenty-three percent.

The profiling results for SoPlex reveal upper bounds on the performance of internal parallelism through extrapolation based on Amdahl’s law. Amdahl’s Law is important when as- sessing the potential performance improvement to software by exploiting parallelism. This law states that if a percentage, P of a system can be conducted in parallel, the maximum performance improvement by conducting the process on infinite processors is given by the inverse of 1− P [2]. Thus a parallel processor can only make a positive impact on the performance of linear programming software if the run-time nature of the code is heavily skewed to a small subset of algorithms that have efficient parallel forms. Though many of the algorithms contained within a sparse solver may effectively target a parallel processor due to high degrees of coarse grained parallelism, it is only possible to impact the software in a meaningful way if these algorithms dominate a high percentage of the runtime. This performance analysis identi- fied that these algorithms do not exist.

Maximum Possible Global Speed Up 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 Function SPxSolver::setupPupdate CLUFactor::solveUleftNoNZ deQueueMin CLUFactor::vSolveUright SPxSteepPR::entered4 CLUFactor::vSolveUrightNoNZ __strcmp_sse42 deQueueMax SPxSolver::updateTest NameSet::number SPxFastRT::maxDelta CLUFactor::vSolveLeft2 CLUFactor::forestUpdate CLUFactor::vSolveLeft SPxFastRT::selectLeave CLUFactor::solveLleft CLUFactor::vSolveLright2 SPxSolver::doPupdate

Figure 4.2: Maximum possible speedup predicted by Amdahl’s Law

using Amdahl’s Law. The maximum possible speed-up to the software is approximately 1.1 times given a parallel sparse matrix vector multiplication kernel that can complete the same calculation as its sequential counterpart in zero time. The second greatest contributor is sparse triangular solve. A parallel form of this algorithm that takes close to negligible amounts of time would also improve the speed of the software by approximately 1.1 times.

Practical implementations of these algorithms will not take zero time and will be subject to overheads. These are from conversions between the special data structures beyond Compressed Sparse Row (CSR) that are required to allow exploitation of parallelism and the transfer of data between a CPU and a massively parallel processor such as a Graphics Processing Unit (GPU). Therefore Amdahl’s law overestimates the actual possible impact from parallelization of these individual algorithms. It is possible that parallel versions of these inherently sequential algorithms could be outperformed by the sequential versions.

4.4. TheMulti-PathSimplexAlgorithm 71

linear programming solver based on theSimplex Algorithmare minor due to the nature of the software. Performance improvements from internal parallelism cannot affect the solver because of the large number of sequential algorithms.

In document Modern Optimization Algorithms and Applications: Architectural Layout Generation and Parallel Linear Programming (Page 79-83)