Chapter 3 Performance modelling and scheduling of a data-dependent code
3.4 Parallelising nreg
Since the runtime ofnregcan be extensive (tens of hours), it may be desirable to have a parallel version ofnreg. Despite the inefficiencies introduced by parallelisa- tion overheads such as duplicated computations, or delays waiting on inter-node communications, a parallel implementation can improve turnaround and resource utilisation in cases where a user has many free machines and wishes to run small registration workflows. Depending on the requirements of a workflow, a mix of sequential and parallel tasks may provide the best overall use of the system. For example, when a workflow consists of a group of registration tasks, each could be al- locatedCPUs in proportion to their expected runtime to balance the workload more
evenly and to avoid scenarios where one slow process delays the overall makespan unnecessarily. Similarly, a high priority individual task could be allocated a number ofCPUs to ensure it meets a deadline.
The processing performed bynregas described in the previous section is almost entirelyCPUand memory bound, and this lack of I/O removes one potential barrier to a scalable parallelisation. On the other hand, the internal workings ofnreg’s al- gorithm appear less amenable to parallelisation: they involve an unknown number of iterations of mesh transformation, each depending on the last. Furthermore this occurs at several different step-sizes and resolutions.
However, the amount of work done inside each iteration is substantial. It involves the experimental movement of thousands of control points and, as observed before, the use of B-splines means that the movement of control points only affects voxels in the vicinity of those control points. The effect of each of these movements can be calculated independently and thus in parallel. The simplest parallel decomposition would involve dividing the work into N equal pieces and handing one to each CPU. This decomposition is simple and needs no communications to arrange, but is inefficient on heterogeneous systems or systems with varying load: the slowestCPUlimits the overall runtime. This analysis also shows that the variable cost of gathering the image statistics means different amounts of work occur per candidate move, making it impossible to statically divide up the workload into equal chunks. This leads to a first parallelisation technique: a simple master/slave decomposition of the workload. Approximately 95% of the application’s runtime is spent in a very small part of the code —EvaluateDerivativecalled repeatedly
fromEvaluateGradient. The remainder of the code runs identically and in lockstep
on eachCPU. When the partial derivative loop is reached inEvaluateGradient, oneCPU(the master), instructs each of the slaveCPUs to perform a small batch of
EvaluateDerivativecalls independently. When this work completes, the results
are passed back to the master and more work is received. When all the partial derivatives for an iteration have been completed, the master distributes all the results to all theCPUs. This involves a broadcast of the gradient vector, a small data structure. Then every CPU returns to ‘lockstep mode’, executing the same code as every other until they enter the next iteration.
The code changes needed to implement this were small and non-invasive: about 200 lines of new code were introduced. Table 3.3shows the runtime of the sequen- tial code on a single CPU along with the speedup of two different parallelisation strategies relative to the sequential code timed using the same datasets. The test
Target Source Sampling NMI eval/reset b7 s2 b7 s2 721 kc/iter 244 kc/iter b7 s2 b7 s1 816 kc/iter 254 kc/iter b9 s2 b9 s1 945 kc/iter 282 kc/iter b9 s4 e2 b9 s3 e2 99 kc/iter 246 kc/iter b9 s3 e2 b8 s3 e2 93 kc/iter 286 kc/iter Table 3.2: EvaluateDerivativecosts CPUs Runtime Speedup 1 Speedup 2
1 58,320 s 1.00× 1.00× 2 31,895 s 1.80× 1.9× 4 19,340 s 3.09× 3.7× 8 14,045 s 4.64× 6.2× 16 9,625 s 6.06× 10.8× 24 — — 13.9×
Table 3.3: Parallel speedup ofnregvs sequential code
environment was a cluster of 16 dual core Opteron 246 running 64 bit SUSE Linux Enterprise Server 9 with version 2.6.5 of the kernel, and connected using a giga- bit ethernet switch. The first parallelisation strategy simply uses the master/slave division of EvaluateDerivativedescribed above. As can be seen from the first speedup column in the table, it is fairly effective for such a simple parallelisation. It scales adequately on smaller experimental clusters, but the scaling is insufficient for a larger production system, especially for a code that is in principle quite amenable to parallelisation.
To improve the speedup further, some of the remaining sequential code is paral- lelised. TheEvaluatesteps in the gradient descent optimisations calculates all the voxels in the transformed image, and measures the similarity with the source. The voxel calculation can be divided up into stripes which are executed in parallel on all theCPUs, and the results returned to the masterCPUand broadcast back to all the slaves. After the broadcast, allCPUs can continue in lockstep to calculate the similarity. Unlike the earlier parallelisation, this is relatively bandwidth intensive and will perform poorly with a slow interconnect, but it can provide some extra scalability on larger systems.