Parallelism Extraction from Sequential Code

Table 3.1: Comparison of different approaches for runtime management.

Domain Implementation Restriction Efficiency Flexibility Scalability Hetero- geneity

Hankins [99] (MISP) HPC HW/SW ISA X X X X

Kumar [145] (Carbon) HPC HW ISA XX ✗ X ✗

Park [196] (HOSK) Embedded HW Dedicated XX ✗ ✗ ✗

Seidel [218], Limberg [165] (CoreManager) Embedded HW Dedicated XXX ✗ ✗ X Arnold [10] (CoreManager) Embedded SW – X X X X Lippett [166] (SystemWeaver) Embedded HW – XX ✗ X X OSIP Embedded HW/SW – XX X X X

ciency and flexibility. The fourth column (Restriction) refers to constraints that the ap- proach impose in the whole hardware platform. The first two approaches require a special ISA, while the next two pure hardware solutions (HOSK and CoreManager) require dedicated interfaces and PEs. The last three approaches, instead, support standard hardware interfaces and can therefore be interfaced with off-the-shelf cores.

The last row refers to the runtime manager proposed in this thesis. As will be seen in Chapter 4, OSIP is an ASIP that provides special instructions for task mapping, scheduling and synchronization. As it is common in an ASIP approach, OSIP represents a tradeoff between pure software and pure hardware solutions. Therefore, it attains a performance close to that of hardware implementations while retaining the flexibility of software.

3.2 Parallelism Extraction from Sequential Code

Automatic parallelism extraction from sequential specifications represents one of the most wanted compiler features. It has been so even since the single core era, during which a great deal of effort was invested in the extraction of fine-grained ILP. An overview of automatic parallelization techniques can be found in [16, 80, 133, 148, 159]. Several of this research results materialized in frameworks such as the Stanford University Intermediate

Framework (SUIF) compiler framework [277] and the Open64 compiler [95]. Today, in

the Multicore era, extracting parallelism automatically has become even more important for two reasons. On the one hand, humans are used to think and therefore program in a sequential manner. On the other hand, years of sequential programming have left industry with billions of lines of sequential code, which cannot be simply thrown away.

Initial works on coarse-grained parallelism extraction were based on traditional static compiler analysis [92, 93, 98, 216]. Researchers soon realized that static techniques, i.e., solely based on code analysis, were not enough to uncover parallelism (see, for example, the discussion in [142]). As a result, several works based on dynamic analysis or spec- ulative execution began to appear around the mid 2000s, both in the embedded and the desktop/HPC domains. Most of the works focus on exploiting TLP within loops (PLP), with less emphasis on coarse-grained DLP.

46 Chapter 3. Related Work

3.2.1 General-Purpose Domain

Works in the desktop computing and HPC domains are characterized by the assumption of a homogeneous target platform and the sole goal of reducing the application makespan. Ottoni et al. [194] proposed an automatic pipeline extraction with the so-called Decoupled

Software Pipelining (DSWP) algorithm. The algorithm performs clustering on the applica-

tion graph and uses a pipeline balancing strategy to improve throughput. This extraction is done at the granularity of instructions, thus is expensive in the absence of specialized hardware. Thies et al. [251] presented an approach to exploit pipelined parallelism from C applications with user annotations that serve to define a functional pipeline. The compiler then uses dynamic analysis to identify the data flowing between pipeline stages. Bridges

et al. [30] described an extension to DSWP that supports Thread Level Speculation (TLS) to-

gether with extensions to the C language that allow to express, for example, that the order in which two functions are called is irrelevant. The approach followed by Rul et al. [214] is similar to the one presented in this thesis. They employ a combination of static and dynamic data flow analyses to obtain a graph representation of the application in which parallelism patterns are searched for. A different approached is followed by Tournavitis

et al. [253], who applied machine learning techniques to decide whether or not to paral-

lelize a loop and which OpenMP [247] scheduling policy to use. This approach applies for OpenMP-enabled platforms, which are uncommon in the embedded domain.

3.2.2 Embedded Domain

As opposed to the software-driven nature of HPC approaches, early works in the em- bedded domain appear to be inspired by High-Level Synthesis (HLS) techniques (see for example [57]). This is characterized by resource constraints considerations and a target timing to be met. As an example, Karkowski et al. [128] described a methodology to map a loop to a pipeline of ASIPs. Instead of supposing existing hardware, this approach could be used to synthesize an application specific SoC.

After initial attempts, the results of parallelizing compilers were unsatisfactory. While in the desktop domain, programmers are satisfied with a good application speedup, this is usually not the case in the embedded domain. Embedded programmers commonly require a near to optimal parallelization, i.e., one that achieves the desired execution time with the least energy consumption. For that reason, embedded software companies are used to allocate more time to application development. Embedded application experts, as opposed to general-purpose programmers, are accustomed to manually tweak code for ir- regular architectures (DSP and ASIPs). With this background in mind, some works focused on helping the designer to improve their code rather than directly performing automatic parallelism extraction. Researchers at the Interuniversity MicroElectronics Centre (IMEC) de- signed an analysis tool called CleanC [112]. The tool identifies several code patterns that hide parallelism in C code and leaves it up to the programmer to fix them. In this way, the refactored code is more amenable for parallelism extraction by other programmers or by automatic parallelization tools. Chandraiah et al. [49] went a step forward by actually including code transformations for parallel execution on heterogeneous MPSoCs. They do this in an interactive framework where transformations are applied under the designer’s full control. A similar approach is followed by the authors of the MPSoC Parallelization

Assist (MPA) [13, 178]. In MPA, the user provides the application code and a high-level

3.2. Parallelism Extraction from Sequential Code 47 ulator, the user can iterate quickly and modify the parallel specification. MPA inserts communication and synchronization primitives automatically, except for shared variables which have to be explicitly identified by the programmer. Instead of having a separate parallel specification, Reid et al. [211] devised extensions to the C language to embed this information in the C source code itself.

Instead of circumventing the shortcomings of parallelizing compilers with expert application knowledge, some authors opted for restricting the input language, e.g., by pro- hibiting C break statements. They can then safely apply parallelization schemes automatically without caring about the intricacies of the language. With this approach it is possible to derive parallel specifications that constrain themselves to a given MoC, like KPN. That is the case of automatic derivation of PNs from MATLAB code, see Harris et

al. [101], and C code, see Verdoolaege et al. [264]. These works support only a subset

of their respective input languages, namely so-called (Static) Affine Nested Loop Programs (SANLPs), i.e., programs that consist of a single loop whose bounds are affine expressions of the iterators. Several relaxations for dynamic loop bounds and while loops have been presented in [185,186,228]. Recently, Geuns et al. [89] published an approach that supports NLPs with unknown upper loop bounds.

The last three examples show that by restricting the input language it is possible to automatically derive parallel implementations. These implementations are generally for- mally founded and therefore display properties which are well received in the embedded domain, e.g., determinism or deadlock-free execution. These solutions, however, do not directly address the problem of generic legacy code. Besides, there is an effort involved in rewriting an application to be compliant with the restricted sequential specification. This includes refactoring the code into an NLP that contains function calls with a sensible granularity for the parallelization to make sense. It is arguable that this effort is comparable to that of rewriting the application using a simple abstract parallel programming model.

Two last works from Weng et al. and Cordes et al. are worth highlighting, since they tackle automatic parallelism extraction from almost arbitrary C code.

Weng et al. [268] presented a partitioning algorithm for applications in the network processing domain. They employ an approximation of the ratio cut algorithm [267] which is adapted to produce balanced partitions. In this way they obtain a relatively balanced functional pipeline. The pipeline stages are thereafter mapped to the target platform by using random search. Weng’s approach is similar to the MAPS partitioning approach in [45]. It is however restricted to the regular workloads and processor arrays that are common in network processing. Besides, their assembly level profiling would make it difficult to integrate it into an interactive framework.

The work by Cordes et al. [56] complements the MPA tool by adding automatic par- allelism extraction. They do so by using a hierarchical task graph, similar to [92], and applying Integer Linear Programming (ILP) to produce a partition. Their ILP formulation takes into account constraints which are common in embedded systems. The authors later applied genetic algorithms for multi-objective optimization [55].

3.2.3 Commercial Solutions

Some of the techniques in the previous sections have made their way into commercial products. The most prominent examples are VectorFabrics [263], Compaan [54] and Crit- icalBlue [58]. The former provides sequential code partitioning tools that are similar to

48 Chapter 3. Related Work

Table 3.2: Comparison of parallelism extraction approaches.

Domain Method Parallelism Input Heterogeneity

Ottoni [194] HPC Clustering PLP C ✗

Bridges [30] HPC Clustering PLP C+ext ✗

Tournavitis [253] HPC Machine learning PLP C ✗

Rul [214] HPC Pattern PLP C ✗

Karkowski [128] Embedded Clustering PLP C ✗

Verdoolaege [264] Embedded – PN C (SANLP) –

Geuns [89] Embedded – PLP C (NLP) –

Weng [268] Embedded Clustering PLP C ✗

Cordes [55, 56] Embedded ILP, GA TLP C ✗

VectorFabrics [263] HPC Pattern D,T,PLP C X

Compaan [54] Embedded – PN C (SANLP) ✗

MAPS Embedded Clustering, Pattern D,T,PLP C X

the ones presented in this thesis. VectorFabrics Pareon tool explicitly searches for patterns of parallel execution in the application. It suggests several patterns to the user with an estimated speedup and lets him decide which to use. VectorFabrics does not provide code generation facilities, but instead, exports so-called recipes. The recipes are detailed steps that the user can follow to parallelize the application. Support for heterogeneous embedded systems is under development in VectorFabrics. Compaan, in turn, has its origin in the previously cited work of Harris [101]. Its Hotspot Parallelizer can automatically translate a C program into a parallel version. As in [101], Compaan can only analyze a subset of the C language, and produces a process network version of the application. CriticalBlue’s Prism tool provides code analysis based on simulation traces as well. Internally, Prism emulates the parallel execution of an application given a parallelization strategy. This al- lows to explore different strategies and perform what if analysis. Prism supports a wider range of platforms than VectorFabrics but provides less help for actually generating code, which is entirely left to the programmer.

3.2.4 MAPS in Perspective

Sections 3.2.1–3.2.3 presented works targetting manual, semi-automatic and automatic parallelism extraction. In order to better place the contributions, approaches for automatic parallelism extraction are listed with their main features in Table 3.2. The method (third column in the table) roughly describes how partitions are obtained. The entry pattern refers to algorithms that explicitly search for parallelism patterns in the application IR. As can be seen from the table, most approaches target parallelism in loops (PLP) and very few are explicitly meant for heterogeneous platforms. If a feature does not apply to an approach, the cell entry is marked empty (’–’). For example, in the approaches by Geuns [89] and Verdoolaege [264], there is no partitioning method, since the tasks are determined by the code lines within the nested loop.

The last row in Table 3.2 stands for MAPS parallelization tools with the extensions presented in this thesis. MAPS initial algorithms were based on graph clustering to exploit TLP [45]. The partitioning algorithm, introduced in Chapter 5, extends the MAPS

3.3. Synthesis of Parallel Specifications 49

In document Programming heterogeneous MPSoCs : tool flows to close the software productivity gap (Page 55-59)