Summary - Discovery of Potential Parallelism in Sequential Programs

We started our tour of data dependence analysis from reviewing advantages and disadvantages of static and dynamic approaches. Static approaches are fast and necessary to enable advanced code optimization, but conservative on dynamically allocated memory, pointers, and dynamically calculated array indices. Dynamic approaches, on the other hand, cover all dynamic memory instructions in one execution, but incur high runtime overhead in terms of both time and space.

In this thesis, we present the DiscoPoP profiler, a generic data-dependence profiler with practical overhead for both sequential and parallel programs. To achieve efficiency in time, the profiler is parallelized, taking advantage of lock-free design. To achieve efficiency in space, the profiler leverages signatures, a concept borrowed from transactional memory. Both techniques are application-oblivious, which is why they do not restrict the profiler’s scope in any way. The profiler also produces the Program Execution Tree (PET) to support parallel pattern detection. Together with other optimization techniques such as variable lifetime analysis and dependence merging, DiscoPoP profiler achieves a slowdown of 86 on average for NAS and Starbench benchmarks, with on average memory consumption of 1020 MB.

An aggressive optimization that skips memory instructions in loops lower the time overhead of profiling further. Without any other optimization technique, skipping memory instructions in loops shortens the profiling time by 41.3% without incurring significant space overhead. Moreover, it provides interesting insights into the distributions of memory instructions and data dependences in NAS and Starbench benchmarks.

3 Computational Units

Existing approaches limit the scope of their search for parallelism to predefined language constructs. For example, the method proposed in [63] is designed to find parallelism only between functions. Other approaches such as [10, 13, 87] are more flexible in that they consider mul- tiple and also in principle arbitrary construct types. Common to all of them, however, is the restriction that they can only answer questions of the following type: (i) Can a construct or region with given entry and exit points be parallelized? (ii) Can a construct with given entry and exit points run asynchronously with other parts of the program? Thus, their underlying strategy first identify the regions of investigation, usually following the structure of the programming language, and then reason about their parallelization.

In contrast to the classic methods, we try to cover parallelism that is not aligned with language constructs. This means we need a new representation of a program where the smallest unit does not contain any unexplored parallelism, and this unit may not be aligned with language constructs. We should analyze dependences among such units for parallelism, and it should be also possible to utilize such units from fine grain to coarse grain. In this chapter, we define the computational unit (CU) to serve as the smallest unit mentioned above. We show algorithms to construct CUs, as well as our new representation of program execution: the CU graph.

3.1 Definition

We define a new language-independent code-granularity level for both program analysis and reflection of parallelism, which we call computational units (CUs). A CU is the smallest unit of code we map onto a thread, that is, while potentially running in parallel to other CUs, a CU itself is not subject to any further (internal) parallelization—at least not within the scope of our method.

The notion of CUs was inspired by our earlier work [88], where a variation of this concept was applied to detect data races on correlated variables. In this thesis, a CU is a collection of instructions following the read-compute-write pattern: a set of variables is read by a collection of instructions and used to perform computation, then the result is written back to another set of variables. We call the two sets read set and write set, respectively. The two sets do not have to be disjoint. The load instructions reading the variables in the read set form the read phase of the CU, and the store instructions writing the variables in the write set form the write phase of the CU.

Definition of a CU. Given a code section C, let GV_c be the set of variables that are global to

C. Let I_x and O_x be the sets of instructions reading and writing variable x, respectively. C is a

computational unit if it satisfies the following condition:

∀v ∈ GVc, I_v → Ov. (3.1)

“→” is the happens-before relationship [89]. Note that “→” is defined on a single variable. Read and write operations on two different variables can be executed in any order if there is no indirect data dependence. It does not conflict with the concept that a CU does not contain any unexplored parallelism: instruction-level parallelism is explored and automatically utilized by the hardware.

Following the definition, the read phase and the write phase of a CU are ∪v∈GVIv and

∪v∈GVOv, respectively. When considering only the read phase and the write phase, a CU

does not hide any true dependences (RAWs) inside that are essential to the data flow of the program, meaning all relevant parallelization opportunities can be analyzed on the level of CUs. Moreover, via control-flow analysis we ensure that CUs never cross the boundaries of a control region. While being small enough, typically not covering more than a few lines of code, to express very fine-grained parallelism, this property ensures that CUs can be easily combined to higher-level constructs such as loops or functions. This allows the reflection of parallelism to be lifted to arbitrarily high levels of abstraction, making our approach general. Note that CUs never crossing control boundaries is not in conflict with the idea that CUs may not be aligned with language constructs: a CU may be part of a construct.

In document Discovery of Potential Parallelism in Sequential Programs (Page 77-80)