Full Race Detection Algorithm - Easier Parallel Programming with Provably-Efficient Runtime Sch

structured futures. In order to do so, we must consider the other aspect of race-detection, namely the access history — for each memory location `, the access history maintains enough information about the previous accesses to ` so that future accesses to ` can detect races.

For serial race detection of series-parallel programs, the access history for each memory location ` contains the last serial reader strand r and writer strand w for ` [71]. Whenever a strand s reads from a memory location `, the race detector checks the reachability data structure to determine whether s is logically parallel with the last writer w; if so, a race is reported. Otherwise, the detector checks if s is in series with the last reader, and replaces it if so. Crucially, storing only the last serial reader suffices when the computation is series- parallel. A writer s is similar but compared against both the last reader and writer. For parallel race detection of series-parallel programs, it suffices to maintain two readers and one writer [132]. In both cases, the access history stores a constant number of previous accesses, and each memory access leads to at most a constant number of queries into the reachability data structure.

This property no longer holds for programs with futures, however. In particular, the access history for memory location ` still holds only one writer strand, namely the most recent writer last-writer(`). However, it must now store an arbitrarily large reader-list. Race detection proceeds as follows. Whenever a strand s reads from the memory location `, the race detector checks the reachability data structure to determine whether s is logically parallel with last-writer(`); if so, a race is reported. Otherwise, s is added to reader-list(`). When a strand s writes to a memory location `, the race detector must check s against all readers in reader-list(`) and with last-writer(`). If s is in parallel with any of them, then it declares a race. Otherwise, the reader list is set to empty and s is stored as last-writer(`).

1. Currently executing strand s reads location `: check if last-writer(`) is in a P bag; if so, declare a race. Otherwise, append r to reader-list(`)

2. Currently executing strand s writes to location `: check if any reader r ∈

reader-list(`) is in a P bag; if so, declare a race. Otherwise, empty the reader list and set last-writer(`) = w.

Figure 7.1: Pseudocode for memory accesses, using the extended SP-Bags algorithm to per- form reachability queries.

We can empty the reader list without missing any races, because anything that executes later that would be in parallel with these readers must also be in parallel with s (which is the new last-writer(`) and the race will be reported with s. Figure 7.1 shows the pseudocode for each memory access.

Unlike series-parallel computations, each write may generate multiple queries. However, we can bound the total number of queries since each writer removes the entire reader-list, yielding the following theorem.

Theorem 7.7. The total running time of race detection for structured single-touch future-

parallel programs is O(T1α(m, n)) time, where α is the inverse Ackermann’s function, m is

the number of memory accesses, and n is the number of spawn or create future calls in the program.

Proof. The fast disjoint-sets data structure provides the bound of amortized time O(α(m, n))

per operation, where m is the number of operations and n is the number of sets. For our program, m is at most the number of memory accesses and n is the number of strands in

the program. Clearly we only do T1 Make-Set and Union operations, so we only need to

We only do queries at memory accesses. On each read, we do one query — checking the reachability between the last writer and the current strand. On each write, we may do many queries — against all of the readers in the reader list. However, we remove all the readers in the reader-list are removed at a write. Therefore, each read leads to at most two queries, one when the read itself occurs and another when a subsequent write to the same memory location occurs. So the total number of queries is bounded by 2 × number of reads. Since the

total number of reads is at most T1, the total number of disjoint-sets operations is O(T1).

The number of sets is the total number of places at which parallelism is created, i.e. the number of spawn or create future calls in the program. Thus the total cost of race detection

is O(T1α(T1, n)).

7.4 Related Work

As discussed in section 4.5, there is a large body of work on race detection for fork/join programs. Other structured computations have also been considered; Dimitrov et al.[60] propose an algorithm for race detection on computations that look like grids while Lee and Schardl [119] propose a race detector for fork-join computations that use a special kind of reduction mechanism. Recently, Surendran and Sarkar [173] proposed the first race detection algorithm for programs that use futures. Their reachability data structure has significantly more overhead than ours, however; in particular, the running time increases quadratically with the number of futures (that is multiplicatively instead of additively as in our case). There are two important distinctions between our approaches. First, the reachability data structure does not encode paths that include both SP and non-SP edges. Therefore, to answer a single reachability question of whether u ≺ v, they must make multiple queries to the reachability data structure. Second, their reachability data structure explicitly stores

a dag and each reachability query does a search on the dag; therefore, each query to the reachability data structure can take more than constant time.

In addition to race-detection for programs with structured parallelism and futures, there is a rich literature on dynamic race detection for programming models that generate computations with nondeterministic dependence structures, such as ones that involve locks [164, 45, 153, 47, 141, 191, 152, 72, 69, 55]. For such models, since the output necessarily depends on the schedule, the best correctness guarantee that a race detector can provide is for a given program, for a given input, and for a given schedule. Also, these race detectors are often based on a persistent threading model; applying them to dynamically multithreaded programs means tracking all memory accesses of the work-stealing scheduler, resulting in high overhead and reporting of benign races in the runtime system.

7.5 Conclusions and Future Work

This chapter presented a race-detection algorithm for programs that use futures in a structured way. This structured use of futures still admits many useful applications [92], but our race detection algorithm to be very efficient. The overhead for our algorithm is proportion to the inverse Ackermann’s function, which is so slow-growing that for all intents and purposes the bound is asymptotically optimal.

One obvious avenue for future work is to implement this algorithm and evaluate them, using the runtime from chapter 5 to implement futures. Also, our algorithm only works when the computation is executed serially — it would be interesting to consider how to parallelize the algorithm. There may be larger classes of restricted futures that still admit efficient race detection algorithms. Chapter 8 discusses race detection for arbitrary use of futures.

Chapter 8 Efficient Race Detection for General Future-

Parallel Computations

The previous chapter presented an efficient (nearly asymptotically-optimal) algorithm for detecting races in computations that use futures in a restricted way. In this chapter we consider race detection on general future-parallel computations, placing no restrictions on how futures can be used. This generality comes at a cost, though we argue that this cost is reasonable for many programs.

General use of futures can generate computation DAGs with arbitrary dependencies. There- fore it seems unlikely that we can provide race detection without some asymptotic overhead. We present an algorithm whose running time depends on the number of get future operations performed in the computation. In particular, we reduce the race-detection problem to doing on-the-fly dynamic reachability queries on a series-parallel dag with extra edges due

to get future. Our algorithm runs in O(T1+ k2) time, where k is the number of get future

As noted in chapter 7, create future and get future are strictly more general than spawn and sync, since we can replace each spawn with a create future and each sync with a get future on each of the futures created so far. However, we do not do so in the model for this chapter, since replacing sync operations with get future will increase k. Separating the two types of parallel primitives is in fact a major advantage of our model.

This dynamic reachability algorithm can be extended with an access history component using the same approach as chapter 7, resulting in a full determinacy race detection algorithm that

runs in time O(T1+ k2).

Outline This is chapter is structured as follows. Section 8.1 explains how we can utilize

our knowledge of series-parallel DAGs to model general DAGs, while section 8.2 builds on this to explain an offline reachability algorithm. We extend this to an online algorithm in section 8.3. The algorithm is proven correct in section 8.4 and the performance of the full race detection algorithm is analyzed in section 8.5. Sections 8.6 and 8.7 discuss related work and conclude, respectively.

8.1 “Nearly” Series-Parallel DAGs

We model future-parallel computations as a series-parallel DAG with some non-SP edges. The create future primitive creates a sub-SP-dag which is ended by the get future primitive. We add artificial SP edges from the computation’s root to each future task’s start node, and from each future task’s end node to the computation sink. These edges are for mod- eling purposes only, allowing us to support future tasks that create and get other futures. These additional edges do not change the meaning of the DAG and are less restrictive than create/get edges.

Note that the root of the computation may now have outdegree greater than two. As in previous chapters, we would like to assume binary forking and spawning. So we remedy this by adding a binary tree of height log k to the start of the computation, where k is the number of future tasks in the computation.

This model of series-parallel computation dags with arbitrary non-SP edges is quite general and subsumes computations that can arise from future [127, 40, 73, 43, 42, 110, 9, 86] or other future-like (such as “put” and “get” [37, 180]) parallel constructs that we encountered in the literature. Therefore, our algorithm would work on all of these primitives.

We need to answer reachability queries in such graphs, where O(k) arbitrary non-SP edges

have been added. We will show how to build a data structure in time O(n + k2_{), where n is}

the total number of vertices.

Formally, we consider reachability on a graph G = (V, ESP ∪ Enon). The minor GSP =

(V, ESP) contains all the series-parallel edges, while Enon contains the non-SP edges. We

assume non-SP edges are not incident on fork or join nodes and that the computation’s

source node is not a fork node1_{. We also assume that the graph description specifies which}

edges are part of ESP and which are the extra edges of Enon. This information is provided

in a programming model with different linguistic keywords for different edge types — both spawn and sync and create future and get future. This has a minor disadvantage to the programmer, since now they must remember more keywords. However, we will see that it allows our reachability data structure to be much more efficient.

We write u ≺ v to denote the presence of a directed path from u to v in the graph in question. We say that u is a predecessor of v and v is a successor of u. To disambiguate, we often

indicate the graph being discussed as a subscript of the precedes symbol, e.g., u ≺G v to

mean that there is a path from u to v in G. (The path can be empty, i.e., we always have

v ≺ v.) We use u ≺SP v as a shorthand for u ≺GSP v, i.e., to indicate that there is a path

from u to v using only the edges ESP. We use u v to refer to a (possibly empty) directed

path from u to v, and we use u _{v to mean a path in G. As before, we use u}G _{v as a}SP

shorthand for uGSP

Consider a node v. If x and y are both predecessors of v and x ≺ y, then we say that y is a

nearer predecessor of v. Similarly, if x and y are both successors of v, and x ≺ y, then we

say that x is a nearer successor. When we use the term “nearer” in this section, we always

mean with respect to the series-parallel graph GSP.

In document Easier Parallel Programming with Provably-Efficient Runtime Schedulers (Page 160-168)