Tuning Algorithms, Tuning Code
Guideline 4.6 Filtering: Avoid inserting an element into a data structure if the
element cannot affect the outcome of the computation.
Besides memoization we have considered three tuning strategies for ES: loop abort for the outer main loop, loop abort for the inner main loop, and filtering the Ps data structure. These strategies interact with one another – for example, the effectiveness of the inner loop abort test depends on whether or not the outer loop abort test is implemented.
As a general rule, the proper way to evaluate tuneups that interact in this way is to use a full factorial design as described in Section 2.2.2. This experimental design permits analysis of the main effects of each tuneup alone, as well as the interaction effects of various combinations. The difficulty is, the design requires 16= 24 design points – as well as code versions – to test all combinations of the four tuneups. Implementing 16 versions of one program is prohibitively time- consuming in many cases. The design would have to be even bigger to incorporate tests of alternative strategies for bounding the diameter, a variety of input classes and sizes, and other runtime environments.
Fortunately we can apply algorithmic reasoning to eliminate most of these combinations from consideration:
• It is a safe bet that memoization improves computation time in every case, because the cost of storing a number in a table is tiny compared to the cost of a redundant search inS. The O(n2) initialization of the distance matrix represents a very small proportion of other initialization costs. We can cut the design in half by not testing versions without memoization.
• The effectiveness of the outer loop abort test depends on the number of invoca- tions toS.distance saved versus the cost of finding the bound on the largest essential edge. The inner loop abort and filtering strategies modify the cost of S.distance, which affects the balance between invocation cost and bounding strategy. This experimental study does not try to optimize that balance, so we omit design points that omit the outer loop test, on the principle that few invoca- tions ofS.distance are better than many invocations, no matter how fast it is. • The only design points remaining involve the inner loop abort test and filtering, totaling four design points. Exploratory experiments on these four versions of the code suggest that the inner loop abort never improves total computation time and sometimes slows it down: it is better not to abort this loop so that more memoization can occur. Experiments also reveal that filtering always provides a small reduction in total computation time.
On the basis of these results, our next implementation V2 incorporates mem- oization (from V1), filtering, and the BFS search to support the outer loop abort test. Average CPU times (in seconds) for 10 trials at each problem size are shown in the table. Runtime n= 800 n = 1000 n = 1200 n = 1400 v0 24.11 49.04 87.57 144.97 v1 .49 .86 1.33 1.93 v2 .24 .40 .62 .91 v0/v2 100.46 122.60 141.24 159.31 v1/v2 2.04 2.15 2.15 2.12
V2 runs twice as fast as V1, mostly because of the loop abort test. Although that test cuts about 90 percent of calls toS.distance, the extra cost of the BFS searches means that the speedup is only a factor of 2.
Altogether, these three tuneups have contributed speedups by factors between 100 and 160 over the original implementation of ES.
Customizing The Data Structure. Profiling reveals our next target of opportunity:
In V2, the Ps.extractMin function takes about 36 percent of computation time, more than any other function. Rather than zooming in on that particular operation, however, we take a wider look atPs and find ways to customize the
data structure by matching operation costs to the operation frequencies imposed
by the ES algorithm.
Priority queue Ps supports four operations: initialize, extractMin, insert, and decreaseKey. Versions V0 through V2 employ a textbook imple- mentation ofPs using a simple binary heap, together with a location array that stores, for each vertex, its location in the heap. The location array is used in the decreaseKey(z, z.dist) operation, to find vertex z in the heap. Here are some alternative ways to implementPs.
• Option 1: initialization vs. insertion. The original version initializes the heap to contain one vertexs and later performs some number I of insert operations. An alternative is to initializePs to contain all n vertices and perform zero inserts later.
• Option 2: initialization vs. insertion. Yet another alternative is to use BFS to find allB≤ n vertices reachable from s. Initialize the heap to contain those B
vertices and perform zero inserts.
• Option 3: memoization. This option memoizes the entire heap, saving it unitll the end of each invocation of S.distance(s,d), and restoring it the next
time the function is called with the same sources. This requires replacing Ps with an arrayP[s] to hold separate heaps for each s. Since edges may have been added toS since the last timeP[s] was saved, the heap may not correctly reflect distances S when restored. Therefore, the restore step must apply the
relax operation to every edge that was added in the interim. This restore step can be implemented without increasing the total asymptotic cost of the algorithm (see [24]). With this modification, the cost of initializing and inserting during each call to S.distance drops to zero, but the restore (relax) step must be performed for some numberR of new edges in each call.
• Option 4: decrease-key vs. sifting. The location array finds z in the heap in constant time, which is clearly better than a linear-cost search forz. On the other hand, thelocation array must be updated every time a heap element is moved by asiftup or siftdown operation. Sifting occurs in the inner loops of theinsert, extractMin, and decreaseKey operations, and updating the location array as well as the heap likely doubles the cost of sifting. If the number
K of decrease-key operations is small compared to T , the number of elements
sifted, it would be faster to omit thelocation array. Instead, decreaseKey could be implemented by placing a new copy ofz in the heap. The extractMin would be modified to skip duplicate keys.
The right choice depends on the values of parametersI , B, R, K, and T and
on the code costs of these various alternatives. Most of these options interact, suggesting another factorial design. But before writing 16= 24copies of the code, we use exploratory experiments to measure these parameters and identify the most promising options.
The first experiment modifies V2 to report the outcome (insert, reject) for each edge in the graph (with the outer loop abort turned off for the moment). From this experiment we learn that when ES is applied to a random uniform graph, it works in two phases. In phase 1,S grows rapidly because most edges are accepted as
essential; in phase 2, whenS is completed, all edges are rejected as nonessential.
These two phases produce distinct patterns of access toPs and different conclu- sions about which implementation options are best. To take a concrete example, in one trial withn= 100 and m = 4950, phase 1 (building S) occupied the first 492
iterations. WhenS was completed, it contained 281 edges; that means that about
40 percent of edges were accepted during phase 1. Phase 2 (when all edges were nonessential) occupied the remaining 4457 iterations.
Here are some observations about the parameters in these two phases; the notationsI1,I2refer to parameterI in phases 1 and 2, respectively.
1. Options 1 and 2. In phase 1 the original implementation performs one initial- ization andI1= 30.9 inserts per call on average. Option 1 would initialize a
heap of sizen= 100 and perform zero inserts. Option 2 would perform a BFS
search onB1= 51.2 vertices (on average), then initialize a heap of size B1, and then perform zero inserts. It is a close call: none of these options is likely to be significantly faster than another. In phase 2, however, option 1 is the clear winner: the original version performsI2= 97.1 inserts on average, which must be slower than initializing a heap of size 100, and the BFS search in option 2 is superfluous sinceB= n.
2. Option 3. In phase 1 the original implementation performedI1= 30.9 inserts andK1= 4.8 decrease-keys per call on average, totaling 14,907.6 = 492×30.9 inserts and 2, 361.6= 492×4.8 decrease-keys. By comparison, option 3 would initialize a heap of sizen= 100 and perform 0 inserts and no more than K1 decrease-keys per call: this option is clearly worth exploring. In phase 2, option 3 is the clear winner because the cost of the restore operation drops to zero, while the original version performsI2= 97.1 inserts per invocation.
3. Option 4. In phase 1 the average number of decrease-keys per call to S.distance is K1= 4.8, compared to T1= 159.4 sift steps: savings would likely accrue from implementing option 4. In phase 2, K2 = 40.4 while T2= 625.6, so the cost difference is smaller.
With the outer loop abort test, V2 spends about 90 percent of its time in phase 1. The path is clear: option 3 should be explored next, and option 4 is also promising. These conclusions should be valid for random uniform graphs at larger values of
n, but initialization might change the balance of some parameters at small n. If the
ES algorithm is to be applied to other input classes, these tests should be rerun to check whether these basic relations still hold.