Filtering: Avoid inserting an element into a data structure if the

Tuning Algorithms, Tuning Code

Guideline 4.6 Filtering: Avoid inserting an element into a data structure if the

element cannot affect the outcome of the computation.

Besides memoization we have considered three tuning strategies for ES: loop abort for the outer main loop, loop abort for the inner main loop, and ﬁltering the Ps data structure. These strategies interact with one another – for example, the effectiveness of the inner loop abort test depends on whether or not the outer loop abort test is implemented.

As a general rule, the proper way to evaluate tuneups that interact in this way is to use a full factorial design as described in Section 2.2.2. This experimental design permits analysis of the main effects of each tuneup alone, as well as the interaction effects of various combinations. The difﬁculty is, the design requires 16= 24 design points – as well as code versions – to test all combinations of the four tuneups. Implementing 16 versions of one program is prohibitively time- consuming in many cases. The design would have to be even bigger to incorporate tests of alternative strategies for bounding the diameter, a variety of input classes and sizes, and other runtime environments.

Fortunately we can apply algorithmic reasoning to eliminate most of these combinations from consideration:

• It is a safe bet that memoization improves computation time in every case, because the cost of storing a number in a table is tiny compared to the cost of a redundant search inS. The O(n2_{) initialization of the distance matrix represents} a very small proportion of other initialization costs. We can cut the design in half by not testing versions without memoization.

• The effectiveness of the outer loop abort test depends on the number of invocations toS.distance saved versus the cost of finding the bound on the largest essential edge. The inner loop abort and filtering strategies modify the cost of S.distance, which affects the balance between invocation cost and bounding strategy. This experimental study does not try to optimize that balance, so we omit design points that omit the outer loop test, on the principle that few invocations ofS.distance are better than many invocations, no matter how fast it is. • The only design points remaining involve the inner loop abort test and filtering, totaling four design points. Exploratory experiments on these four versions of the code suggest that the inner loop abort never improves total computation time and sometimes slows it down: it is better not to abort this loop so that more memoization can occur. Experiments also reveal that filtering always provides a small reduction in total computation time.

On the basis of these results, our next implementation V2 incorporates memoization (from V1), ﬁltering, and the BFS search to support the outer loop abort test. Average CPU times (in seconds) for 10 trials at each problem size are shown in the table. Runtime n= 800 n = 1000 n = 1200 n = 1400 v0 24.11 49.04 87.57 144.97 v1 .49 .86 1.33 1.93 v2 .24 .40 .62 .91 v0/v2 100.46 122.60 141.24 159.31 v1/v2 2.04 2.15 2.15 2.12

V2 runs twice as fast as V1, mostly because of the loop abort test. Although that test cuts about 90 percent of calls toS.distance, the extra cost of the BFS searches means that the speedup is only a factor of 2.

Altogether, these three tuneups have contributed speedups by factors between 100 and 160 over the original implementation of ES.

Customizing The Data Structure. Proﬁling reveals our next target of opportunity:

In V2, the Ps.extractMin function takes about 36 percent of computation time, more than any other function. Rather than zooming in on that particular operation, however, we take a wider look atPs and ﬁnd ways to customize the

data structure by matching operation costs to the operation frequencies imposed

by the ES algorithm.

Priority queue Ps supports four operations: initialize, extractMin, insert, and decreaseKey. Versions V0 through V2 employ a textbook implementation ofPs using a simple binary heap, together with a location array that stores, for each vertex, its location in the heap. The location array is used in the decreaseKey(z, z.dist) operation, to ﬁnd vertex z in the heap. Here are some alternative ways to implementPs.

• Option 1: initialization vs. insertion. The original version initializes the heap to contain one vertexs and later performs some number I of insert operations. An alternative is to initializePs to contain all n vertices and perform zero inserts later.

• Option 2: initialization vs. insertion. Yet another alternative is to use BFS to ﬁnd allB≤ n vertices reachable from s. Initialize the heap to contain those B

vertices and perform zero inserts.

• Option 3: memoization. This option memoizes the entire heap, saving it unitll the end of each invocation of S.distance(s,d), and restoring it the next

time the function is called with the same sources. This requires replacing Ps with an arrayP[s] to hold separate heaps for each s. Since edges may have been added toS since the last timeP[s] was saved, the heap may not correctly reﬂect distances S when restored. Therefore, the restore step must apply the

relax operation to every edge that was added in the interim. This restore step can be implemented without increasing the total asymptotic cost of the algorithm (see [24]). With this modiﬁcation, the cost of initializing and inserting during each call to S.distance drops to zero, but the restore (relax) step must be performed for some numberR of new edges in each call.

• Option 4: decrease-key vs. sifting. The location array ﬁnds z in the heap in constant time, which is clearly better than a linear-cost search forz. On the other hand, thelocation array must be updated every time a heap element is moved by asiftup or siftdown operation. Sifting occurs in the inner loops of theinsert, extractMin, and decreaseKey operations, and updating the location array as well as the heap likely doubles the cost of sifting. If the number

K of decrease-key operations is small compared to T , the number of elements

sifted, it would be faster to omit thelocation array. Instead, decreaseKey could be implemented by placing a new copy ofz in the heap. The extractMin would be modiﬁed to skip duplicate keys.

The right choice depends on the values of parametersI , B, R, K, and T and

on the code costs of these various alternatives. Most of these options interact, suggesting another factorial design. But before writing 16= 24_{copies of the code,} we use exploratory experiments to measure these parameters and identify the most promising options.

The ﬁrst experiment modiﬁes V2 to report the outcome (insert, reject) for each edge in the graph (with the outer loop abort turned off for the moment). From this experiment we learn that when ES is applied to a random uniform graph, it works in two phases. In phase 1,S grows rapidly because most edges are accepted as

essential; in phase 2, whenS is completed, all edges are rejected as nonessential.

These two phases produce distinct patterns of access toPs and different conclusions about which implementation options are best. To take a concrete example, in one trial withn= 100 and m = 4950, phase 1 (building S) occupied the ﬁrst 492

iterations. WhenS was completed, it contained 281 edges; that means that about

40 percent of edges were accepted during phase 1. Phase 2 (when all edges were nonessential) occupied the remaining 4457 iterations.

Here are some observations about the parameters in these two phases; the notationsI1,I2refer to parameterI in phases 1 and 2, respectively.

1. Options 1 and 2. In phase 1 the original implementation performs one initial- ization andI1= 30.9 inserts per call on average. Option 1 would initialize a

heap of sizen= 100 and perform zero inserts. Option 2 would perform a BFS

search onB1= 51.2 vertices (on average), then initialize a heap of size B1, and then perform zero inserts. It is a close call: none of these options is likely to be signiﬁcantly faster than another. In phase 2, however, option 1 is the clear winner: the original version performsI2= 97.1 inserts on average, which must be slower than initializing a heap of size 100, and the BFS search in option 2 is superﬂuous sinceB= n.

2. Option 3. In phase 1 the original implementation performedI1= 30.9 inserts andK1= 4.8 decrease-keys per call on average, totaling 14,907.6 = 492×30.9 inserts and 2, 361.6= 492×4.8 decrease-keys. By comparison, option 3 would initialize a heap of sizen= 100 and perform 0 inserts and no more than K1 decrease-keys per call: this option is clearly worth exploring. In phase 2, option 3 is the clear winner because the cost of the restore operation drops to zero, while the original version performsI2= 97.1 inserts per invocation.

3. Option 4. In phase 1 the average number of decrease-keys per call to S.distance is K1= 4.8, compared to T1= 159.4 sift steps: savings would likely accrue from implementing option 4. In phase 2, K2 = 40.4 while T2= 625.6, so the cost difference is smaller.

With the outer loop abort test, V2 spends about 90 percent of its time in phase 1. The path is clear: option 3 should be explored next, and option 4 is also promising. These conclusions should be valid for random uniform graphs at larger values of

n, but initialization might change the balance of some parameters at small n. If the

ES algorithm is to be applied to other input classes, these tests should be rerun to check whether these basic relations still hold.

In document 9cgmv.A.Guide.to.Experimental.Algorithmics.pdf (Page 124-127)