The MetaFork language - Towards Comprehensive Parametric Code Generation Targeting Graphics Pro

MetaFork [29] is a high-level programming language extending C/C++, which combines several models of concurrency, including fork-join and pipelining parallelisms. MetaFork is also a compilation framework, which aims at facilitating the design and implementation of concurrent programs through three key features:

1. Perform automatic code translation between concurrency platforms targeting both multi- core and many-core GPU architectures.

2. Provide a high-level language for expressing concurrency as in the fork-join model, the SIMD (single instruction multiple data) paradigm and the pipelining parallelism. 3. Generate parallel code from serial code with an emphasis on code depending on machine

or program parameters (e.g. cache size, number of processors, number of threads per thread-block).

As of today, the publicly available and latest release of MetaFork, see www.metafork. org, offers the second feature stated above, a preliminary implementation of the third feature as well as the multi-core and many-core portions of the first one. To be more specific,Meta- Forkis a meta-language for concurrency platforms based on the fork-join model, pipelining parallelism and the SIMD paradigm. This meta-language forms a bridge between actual multithreaded programming languages, and we use it to perform automatic code translation between those languages.

In an earlier work [29], MetaForkwas introduced as an extension of both the C and C++ languages into a multithreaded language based on the fork-join concurrency model [11]. Thus, concurrent execution is obtained by a parent thread creating and launching one or more children threads, so that the parent and its children execute a so-calledparallel region. An important ex- ample of parallel regions arefor-loop bodies.MetaForkhas four parallel constructs dedicated to the fork-join model: function call spawn, block spawn, parallelfor-loop and synchroniza- tion barrier. The first two use the keywordmeta fork, while the other two use, respectively, the keywords meta for and meta join. Similar to the CilkPlus specifications, the parallel constructs ofMetaForkgrant permission for concurrent execution but do not command it. Hence, aMetaForkprogram can execute on a single core machine. We emphasize the fact that meta fork allows the programmer to spawn a function call (like in CilkPlus [10, 79, 35]) as well as a block (like in OpenMP [41, 13, 9]). Using the same examples from [29], Fig- ures 2.5, 2.6 and 2.7 illustrate automatic code translation between theOpenMPprogram and the CilkPlusprogram via theMetaForklanguage.

On the other hand, stencil computations are a major pattern in scientific computing. Sten- cil codes perform a sequence of sweeps (called time-steps) through a given array, and each sweep can be seen as the execution of a pipeline. When expressed with concurrency platforms based on, and limited by, the fork-join model, parallel stencil computations incur excessive

long fib(long n) { long x, y; if (n<2) { return n; } else if (n<BASE) return fib_serial(n); else { x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return (x+y); } }

(a) A givenCilkPlusprogram

long fib(long n) { long x, y; if (n<2) { return n; } else if (n<BASE) return fib_serial(n); else { x = meta_fork fib(n-1); y = fib(n-2); meta_join; return (x+y); } }

(b) The intermediateMetaFork program long fib(long n) { long x, y; if (n<2) { return n; } else if (n<BASE) return fib_serial(n); else {

#pragma omp task shared(x) x = fib(n-1);

y = fib(n-2);

#pragma omp taskwait return (x+y);

} }

int main() { int a[N]; int b = 0;

#pragma omp parallel #pragma omp for private(b) for (int i=0;i<N;i++) {

b = i ; a[i] = b; }

}

(a) A givenOpenMPprogram

int main() { int a[N]; int b = 0;

meta_for (int i=0;i<N;i++) { int b; b = i ; a[i] = b; } }

(b) The intermediateMetaFork program

int main() { int a[N]; int b = 0;

cilk_for (int i=0;i<N;i++) { int b; b = i ; a[i] = b; } }

Figure 2.6: UsingMetaForkto translate a givenOpenMPprogram into aCilkPlusprogram

parallelism overheads. This problem is studied by Shirako, Unnikrishnan, Chatterjee, Li and Sarkar [107] together with a solution in the context ofOpenMPby proposing new synchroniza- tion constructs to enabledo-across parallelism. These observations have motivated a first extension of the MetaFork language with three constructs to expresspipeliningparallelism: meta pipe, meta wait andmeta continue. Recall that a pipeline is a linear sequence of processing stages through which data items flow from the first stage to the last stage. If each stage can process only one data item at a time, then the pipeline is said to beserialand can be depicted by a (directed) path in the sense of graph theory. If a stage can process more than one data item at a time, then the pipeline is said to be paralleland can be depicted by a directed acyclic graph (DAG), where each parallel stage is represented by an independent set, that is, a set of vertices of which no pair is adjacent.

In order to generate efficientCUDAcode from an inputMetaForkprogram, we introduced a tenth keyword, namelymeta schedule, in [19]. This keyword allows its body to be sched- uled on a device, such as the NVIDIA GPU, to execute in a SIMD fashion. In Chapter 5 as well as Appendix C, we depict the MetaFork-to-CUDA code generator, which is capable of

2.5. Automatic parallelization in the polyhedral model 19

int main() {

int sum_a = 0, sum_b = 0; int a[ 5 ] = {0,1,2,3,4}; int b[ 5 ] = {0,1,2,3,4}; #pragma omp parallel {

#pragma omp sections {

#pragma omp section {

for(int i=0; i<5; i++) sum_a += a[ i ]; }

#pragma omp section {

for(int i=0; i<5; i++) sum_b += b[ i ]; }

} } }

(a) A givenOpenMPprogram

int main() {

int sum_a = 0, sum_b = 0; int a[ 5 ] = {0,1,2,3,4}; int b[ 5 ] = {0,1,2,3,4};

meta_fork shared(sum_a) { for(int i=0; i<5; i++)

sum_a += a[ i ]; }

meta_fork shared(sum_b) { for(int i=0; i<5; i++)

sum_b += b[ i ]; }

meta_join; }

(b) The intermediateMetaForkprogram

void fork_func0(int* sum_a,int* a) { for(int i=0; i<5; i++)

(*sum_a) += a[ i ]; }

void fork_func1(int* sum_b,int* b) { for(int i=0; i<5; i++)

(*sum_b) += b[ i ]; }

int main() {

int sum_a = 0, sum_b = 0; int a[ 5 ] = {0,1,2,3,4}; int b[ 5 ] = {0,1,2,3,4}; cilk_spawn fork_func0(&sum_a,a); cilk_spawn fork_func1(&sum_b,b); cilk_sync; }

Figure 2.7: UsingMetaForkto translate a givenOpenMPprogram into aCilkPlusprogram

automatically generating compilable CUDA programs with program parameters (like number of threads per thread-block) and machine parameters (like shared memory size) allowed at code-generation-time.

This work is summarized in [19].

In document Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation (Page 31-33)