2.2 Adapting algorithms to distributed computing frameworks
3.1.1 Conjugate Gradient linear system solver (CG)
Solving systems of linear algebraic equations (SLAE) is a problem often encountered in the fields like engineering, physics, chemistry, computer science or economics. The main goal is different in each of these fields, but the main challenge stays the same. How to efficiently solve systems of linear equations with a huge number of unknowns? For extremely large matrices, it is often unfeasible to find an exact solution for the system of linear equations, either because of time or resource constraints, and an approximation of the solution vector x is found instead.
Conjugate Gradient [63] is an iterative algorithm for solving algebraic systems of linear equations. It solves linear systems using matrix and vec- tor operations. Linear system is first transferred into the matrix form:
Ax = b (3.1)
where A is a matrix consisting of the coefficients a11, a12, ..., amn of the
system, b is a known vector consisting of constant terms of the system b1, b2, ..., bm and x is the solution vector, made up of the unknowns of the
system x1, x2, ..., xn.
CG then performs an initial inaccurate guess of the solution x and then iteratively improves its accuracy by applying gradient descent by using the matrix A and vector b values to find the approximate vector x values, if a solution exists at all. The accuracy of the Conjugate Gradient result depends on the number of iterations that are executed.
The input matrices are generally large, but can typically fit into collec- tive memory of computer clusters. The computational complexity of the algorithm is not high and the performed task at every iteration is relatively small which means the ratio between communication and computation is unusually high, especially in comparison to CLARA and PAM. This makes CG a good candidate as an additional benchmarking algorithm for iterative MapReduce frameworks.
Unknowns 24 500 1000 2000 4000 6000 8000 1 node 259 261 327 687 1938 3810 7619 2 nodes 255 259 298 507 1268 2495 4185 4 nodes 255 236 281 360 721 1374 2193 8 nodes 251 251 291 397 563 824 1246 16 nodes 236 240 278 297 338 511 809
Table 3.1: Run times for the CG implementation in MapReduce under varying cluster size. [2]
Adapting CG to MapReduce is relatively complex task as it is not pos- sible to directly adapt the whole algorithm to the MapReduce model. The matrix and vector operations used by CG at each iterations can be reduced to the MapReduce model instead. Every time one of these operations is used in the CG, a new MapReduce job is executed. As a result, multiple MapReduce jobs are executed at every iteration. This is not efficient as it takes time for the Hadoop framework to schedule, start up and finish Map- Reduce jobs. It can be viewed as MapReduce job latency and executing multiple jobs at each iteration adds up to a significant overhead.
Additionally, in Hadoop, the matrix A is stored on the HDFS and is used as an input for the matrix-vector multiplication operation at every iteration. In Hadoop MapReduce framework it is not possible to cache the input between different executions of MapReduce jobs, so every time this operation is executed, the input must be read again from the file system. As the matrix A values never change between iterations, the exact same work is repeated at every iteration. This adds up to a significant additional overhead.
Experiments were run with different number of parallel nodes to be able to calculate relative parallel speedup. Relative parallel speedup mea- sures how many times the parallel execution is faster than running the same MapReduce algorithm on single node. If it is larger than 1, it means there is at least some gain from doing the work in parallel. Speedup which is equal to the number of nodes is considered ideal and means that the algorithm has a perfect scalability.
We also generated different sized input matrices to investigate the effect of the problem size on the performance. While CG is typically used for larger sparse matrices mostly consisting of zeros, we used dense matrices
Figure 3.1: Speedup for Conjugate Gradient algorithm with different number of nodes. [2]
to simplify the benchmarking process. Run times for the CG algorithm are shown in Table 3.1 and calculated speedup is shown on Figure 3.1.
It took 220 seconds to solve a system with only 24 unknowns in a 16 node cluster, which is definitely very slow for solving a linear system with such a small number of calculations needed. It indicates that most of the time is spent on the background tasks and not on the actual calculations.
CG MapReduce algorithm is able to achieve much better parallel speed- up when solving larger linear systems. It took almost 2 hours to solve a linear system with 8000 unknowns on one node and 809 seconds on 16 nodes. While these results show that MapReduce is able to take advan- tage of additional computing resources, it does not mean that the time is efficiently spent on actual computations. To investigate this further, we de- cided to also adapt CG to Twister, which is a MapReduce framework that is designed for iterative applications and has previously been described in in section 2.1.1. Results for these experiments are provided in section 3.2.