• No results found

8.5 Performance Analysis

8.5.1 Full Race Detection

We have shown the efficiency of the dynamic reachability algorithm. This can be extended for a full race detection algorithm by adding an access history component. Note, again, that unlike series-parallel computations, each write may generate multiple queries; however, for the same reason as explained in section 7.3 the total number of queries is bounded by two times the total number of reads since each writer removes the entire reader-list. Therefore,

the total running time is for race detection is O(T1+ k2).

8.6

Related Work

Related to, but easier than, our problem is the static problem of building a reachability oracle for a directed graph. In the case where the base graph is an n-node directed tree and

k non-tree edges are added, Wang et al. [187] provide an algorithm that requires O(n + k2)

space, O(n+k3) construction time, and answers queries in O(1) time. Our algorithm achieves

the same space, a better construction time, and works for more general series-parallel graphs. Moreover, our algorithm is incremental, supporting edge insertions, whereas their algorithm is offline. Also related is the problem of labeling each vertex (offline) such that queries can be answered by simply comparing vertex labels. The best practical algorithm we are aware

of uses 2-hop labels [49], but its construction time is polynomial, and there is no nontrivial bound known for the label size and hence query time once arbitrary edges are included in the graph.

Related work for race detection for fork/join programs is discussed in section 4.5. Race detection for other models is mentioned in section 7.4. To our knowledge there exists only one algorithm exists for determinacy race detection for dynamic multithreading with futures. Surendran and Sarkar [174] presented an online, serial algorithm for this problem. Given a

program that normally executes in T1 time on a single processor, that algorithm executes in

time

O (T (k + 1)(n + 1)α(T, s + k))

and space

O (s + k + n + v(k + 1))

where k is the number of futures created, s is the number of spawned tasks, n is the number of future touches, v is the number of shared memory locations, and α is the inverse Ackermann function.

8.7

Conclusions and Future Work

This chapter presented a race-detection algorithm for general future-parallel programs that

pipeline parallelism, a weaker but very common expression of parallelism. It is an open question whether a more specialized technique exists for race detection in pipeline parallel DAGs which may be more efficient.

Note that if one is not careful, a program with unstructured futures can deadlock. Such a deadlock is deterministic, however, and does not depend on the schedule. In such cases, our algorithm detects races until the execution deadlocks.

The most glaring opportunity for future work is an implementation of this algorithm, com- paring it to Surendran and Sarkar’s [174] algorithm. The algorithm presented here would be rather complicated to implement, requiring that we maintain the entire computation graph is memory in order for the graph searches. It would be interesting to consider similar algo- rithms that may be able to avoid the graph searches. There may also be different algorithms

which reduce the overhead of the k2 term.

In addition, our current algorithm executes a computation serially. In future work we hope to develop a parallel algorithm for race detection, either as an extension to the presented algorithm or via a new algorithm.

Chapter 9

Conclusions and Future Directions

It is increasing important to write software that utilizes modern multicore hardware. To offset the new challenges that parallel programming brings, we will need different techniques and tools for programming such hardware. Concurrency platforms like Cilk Plus [99] have decreased the difficulty in parallel programming, but more can be done. This dissertation has explored how runtime systems can be designed to support better tools to improve parallel programming.

Specifically, we have explored the design space around allowing Cilk workers to have mul- tiple deques, considering the consequences of each design point to dynamic tools to ease parallel programming. The result is a set of runtime systems and tools which make parallel programming a more pleasant experience.

Chapter 3 presented Batcher, a runtime scheduler that automatically groups concurrent data structure operations into groups and invokes operations on a batched data structure,

efficiently scheduling work on this data structure and on the core computation. The result al- lows batched data structures to be treated essentially as sequential data structure operations, including a theoretical performance guarantee.

A specialization of that runtime scheduler is provided in chapter 4, yielding a tool that automatically detects determinacy races in Cilk programs. By leveraging custom runtime support, CRacer runs efficiently and in parallel, reducing the time for the typical develop- test-debug feedback loop.

Chapter 5 covers an extension to the Cilk Plus runtime system that allows non-blocking suspension and resumption at arbitrary points. This can be used to support more advanced synchronization than is available in Cilk (fork/join programs) as well as hiding the latency of external events, such as network operations or I/O.

PORRidge, a debugging tool for non-deterministic parallel programs, is designed and exam- ined in chapter 6. While existing record-and-replay tools target persistent thread models, PORRidge leverages the structure of fork/join dynamically multithreaded programs to pro- vide process-oblivious record and replay of lock acquisitions in Cilk programs. PORRidge comes with a nearly-optimal performance bound and has been shown to perform well in practice.

Finally, chapters 7 and 8 extends the work in chapter 4 with the goal of detecting races for broader classes of computations. In particular, it considers the of futures, which can be used to generate arbitrary computation DAGs. A useful but restricted use of futures is considered in chapter 7, while chapter 8 handles any use of futures. In both cases we design efficient race detection systems for such computations. Work on implementing these systems,

using the runtime from chapter 5 to bring futures into the purview of Cilk’s work stealing scheduler, is in-progress.

In fact, none of these runtime systems contain mutually exclusive features. With some en- gineering effort, these could be merged into a single runtime system that may serve as an important platform for future research.

9.1

Future Work

Of course, the investigations in this dissertation represent only a sample of this research space. As Matt Might describes [133], a dissertation is just a small dent in the boundaries of human knowledge. In addition to the many ways each individual project may be improved (engineering improvements, improved performance bounds, etc.), there are many directions not addressed by this dissertation.

Though we have mostly considered how runtime support can improve tools for parallel programming, there are many levels in the software development stack that might be recon- sidered in the context of parallel runtime systems. For example, Lee et al. [120] explore how an extension to Linux’s virtual memory system can be used to improve the efficiency and interoperability of the Cilk Plus runtime system.

This dissertation has focused only on shared-memory homogeneous systems. With the pop- ularity of multicore architectures we have also seen fresh exploration of heterogeneous hardware designs, where different cores may have different performance characteristics. These parallel machines can be particularly difficult to program, but some of the burden might be

shifted to the runtime system if they can hide this heterogeneity and provide good per- formance. The same can be said for distributed systems, where communication between subsystems must be an integral consideration of any scheduler.

Perhaps the largest gap in our understanding of this area pertains to locality of memory use. Even in the sequential systems there is a large gap between memory latency and clock speeds, leading to decreasing processor utilization over the years. Memory latency is often worse in parallel machines, where some memory may be closer to some cores than others and thus transfer times may differ depending on location. Runtime systems may find increasing use in automatically handling some memory transfers, reducing complexity in much the same way that runtime garbage collectors reduce the complexity of manual memory management. Further, though there is plenty of work on minimizing cache misses for sequential algorithms (e.g. [18, 78], understanding cache misses in the context of dynamic scheduling is a harder problem [92, 44, 2]. In particular, it is unclear how a runtime scheduler might strike the right balance between load balance and locality of reference.