Software architectures for flexible task-oriented program execution on multicore systems

(1)

Software architectures for flexible task-oriented

program execution on multicore systems

THOMAS

RAUBER

University Bayreuth rauber@uni–bayreuth.de

GUDULA

R ¨

UNGER Chemnitz University of Technology [email protected]–chemnitz.de

Abstract

The article addresses the challenges of software development for current and future parallel hardware architectures which will be dominated by multicore and manycore architectures in the near future. This will have the following effects: In several years desktop computers will provide many computing resources with more than 100 cores per processor. Using these multicore processors for cluster systems will create systems with thousands of cores and a deep memory hierarchy. A new generation of programming methodologies is needed for all software products to efficiently exploit the tremendous parallelism of these hardware platforms.

The article aims at the development of a parallel programming methodology exploiting a two-level task-based view of application software for the effective use of large multicore or cluster systems. Task-based programming models with static or dynamic task creation are discussed and suitable software architectures for designing such systems are presented.

1 Introduction

Future cluster and multicore systems will soon offer ubiquitous parallelism not only for high-performance computing but for all software developers and application areas. However, the mainstream programming model is still sequential and von Neumann oriented, and sophisticated programming techniques are required to access the parallel resources available. Therefore, a change in programming and software development is imperative for making the capabilities of the new architectures available to program-mers and users of all kinds of software systems.

The article aims at the software development for future parallel systems with a large number of compute units (cores). Although many parallel programming models, including MPI, OpenMP, and Pthreads, have been developed in the past (especially tar-geting HPC and scientific computing), none of them seems to be appropriate for main stream software development for large-scale complex industrial systems since the level of abstraction provided is too low-level [22]. The trend in hardware technology towards large multicore systems requires a higher level of abstraction for software development to reach productivity and scalability. To provide such programming abstractions for parallel executions is a key challenge of future software development

(2)

In this article, we propose a parallel programming methodology exploiting task-based views of software products, so that future multicore and cluster systems can be used efficiently and productively without requiring too much effort by the programmer. The main goal is to deliver a hybrid, flexible and abstract parallel programming model at a high level of abstraction (in contrast to low-level state-of-the-art models like MPI or Pthreads) and a corresponding programming environment. A main feature of the approach is a decoupling of the specification of a parallel algorithm from the actual execution of the parallel work units identified by the specification on a given parallel system. This allows the programmer to concentrate on the characteristics of his algo-rithm and relieves him from low-level details of the parallel execution. In particular, we propose to extend the standard model of a task-based execution to multi-threaded tasks that can be executed by multiple cores and in parallel with other multi-threaded tasks of the same application program.

The use of an appropriate specification mechanism allows the expression of algo-rithms from different application areas on an abstract level and relieves the programmer from an explicit mapping of computations to processors as well as from the specifica-tion of explicit data exchanges. This facilitates the development of parallel programs significantly and, thus, makes parallel computing resources available to a large com-munity of software developers. A coordination language provides support for the spec-ification of tasks and for expressing dependencies between tasks.

In the following sections, we present an overview of a new approach for task-oriented programming in Section 2, a description of the corresponding software archi-tectures for task execution in Section 3, and a runtime evaluation in Section 4. Section 5 discusses related work and Section 6 concludes.

2 Programming models with tasks

In this section, we give a short overview of task-based programming, which we propose to use for multicore systems with a large number of cores.

Task-based programming is based on a decomposition of the application program into tasks that can cooperate with each other. Task creation can be fixed statically at compile time or at program start, but can also evolve dynamically during program exe-cution. An important property is that, for an arbitrary execution platform, task creation is separate from task execution, i.e., tasks do not need to be executed immediately after their creation, and the specification of tasks does not necessarily fix details of their execution. Moreover, the task definitions may vary significantly depending on the special needs of the execution. In the following, task-oriented programming models are classified according to different criteria, including static or dynamic task creation, sequential or parallel execution of a single task, or the specification of dependencies between tasks.

2.1 Task decomposition

An application program is decomposed into tasks such that the application can be rep-resented by a tuple (T,C) with a set of tasksT = {T₁, T₂, . . .}and a coordination

(3)

structureC defining the interactions between tasks. A task incarnates a specific com-putation in a piece of program code. Usually, a task captures a logical unit of the application and can be defined with different granularities. The coordination structure describes possible cooperations between the tasks of a specific application program. For the execution mode of tasks, we can distinguish between:

• sequential execution of a task on a single core;

• parallel execution of a task on several cores with a shared address space using a multi-threaded execution with synchronization;

• parallel execution of a task on several cores or execution units employing a dis-tributed address space performing intra-task communication by message-passing;

• parallel execution of a task on several cores with a mixed shared and distributed address space, containing both synchronization and communication.

Moreover, it can be distinguished between task executions for which the number of cores used for the execution is fixed when deploying the task, and between task ex-ecutions for which the number of cores can be changed during execution. The latter case is particularly interesting for long-running tasks for which an adaptation to the specific execution situation of the target platform is beneficial. For parallel platforms with multicore processors, a model with a parallel execution of a single task using a shared address space is particularly interesting, since this allows the mapping of a sin-gle tasks to all cores of a multicore processor, thus exploiting the memory hierarchy efficiently. In this case, the single tasks are implemented in a multi-threaded way using shared variables for information exchange.

The cooperation between the tasks can also be specified in different ways, leading to the following main characteristics for task cooperations:

• tasks cooperate by specifying input-output relations, i.e., one task outputs data that is used by another task as input; in this case, the tasks must be executed one after another;

• tasks are independent of each other, allowing a concurrent execution without inter-actions;

• tasks can cooperate during their execution by exchanging information or data, thus requiring a concurrent execution of cooperating tasks;

The different execution modes combined with the different cooperation modes result in several different task-based programming models, which require different implemen-tation support for a correct and efficient execution.

An additional criterion is the time when the tuple (T,C) is defined. This can be done statically such that the application program defines the tasks and their interactions in the source program. In this case, all tasks and their interactions are known at compile time before the actual program execution, and appropriate scheduling and mapping techniques can be used to yield the best task execution on a given platform. In contrast, the tasks may also be created dynamically at runtime. In this case, the tasks and their interactions are not known at program start and, thus, task deployment must be planned at runtime.

In the following, we consider a task-oriented programming model with parallel tasks using a shared address space and a coordination structure based on input-output

(4)

dependencies, i.e., there are no interactions between concurrently running tasks during their execution. Application programs coded according to this model are suitable for single multicore processors as well as for clusters with multicore processors. Both, the static and the dynamic case are considered.

2.2 Task execution and interaction

The set of tasksT consists of parallel tasks that can be executed on multiple cores of a multicore processor or the nodes of a multicore cluster consisting of several multi-core processors. For each task, a parameterized number ofpcores can be used. Each task is implemented as a multi-threaded program using shared variables for informa-tion exchange. For each task, we distinguish between internal variables and external variables. The internal variables are only visible within the task and can be accessed by all threads executing that task. For the threads executing a task, a synchronization mechanism for a coordinated access to the internal variables has to be used. Such synchronization mechanisms, e.g., lock variables, are available in current languages or libraries for thread-based programming of shared address space, such as Pthreads, OpenMP, or Java threads. The external variables of a taskT are visible to other tasks

T′ _{∈ T} _{of the source program. The visibility is restricted to a input-output relation} between tasks, which means that an explicit variableAproduced by taskTcan be used as input by another taskT′_{. In such a case, the execution of}_T _{has to be finished} be-fore the execution ofT′_{can start. Tasks}_T _and_T′_{can be executed on the same set of} cores, but also different sets of cores can be used. In the latter case, synchronization or communication has to be used between the execution ofT andT′_{to make the output} variables ofTavailable to the cores executingT′_{. Thus, there has to be a coordinated} access to a variableAat task level. Considering internal and external variables, a syn-chronization structure at two separate levels results. The advantage of this approach is that the synchronization of variables is restricted to smaller parts of the program and that only specific interaction via shared variables is allowed: interactions within a task via internal variables and interactions between tasks through input-output relations via external variables. Another advantage of this two-level structure is an increase of par-allelism due to a parallel execution of a single task as well as a concurrent execution of independent tasks.

2.3 Internal and external variables

For a taskT ∈ T, letIVT denote the set of internal variables andEVT the set of external variables. Each external variable is either an input variable provided forT

before the actual execution starts, or an output variable that is computed byT and is available after the execution ofT has been finished. For the internal and external variables of the tasksT ∈ T of one specific task- based parallel program, the following requirements have to be fulfilled for the two-level shared-memory task structure:

• For any two tasksT, T′ _{∈ T}_{, the set of internal variables has to be disjoint, i.e.,}

IVT∩IVT′ =∅. This can be achieved by a locality of data variables and a visibility

(5)

visible: EV , IV T₃ visible: EV , IV T1 cores time T₂ visible: EV , IV

barrier and flush

access to SV_{T T} 2 3 T 2 T 2 access to 3 T T1 SV T 1 T 1 1 T T3 T T2 3 of SV and SV T T1 3 T T2 3 T 3 T 3 access to SV and SV

Figure 1: Execution situation for three tasksT1, T2, T3 withEVT1∩EVT2 =∅, EVT1∩

EVT2 =SVT1T3 andEVT2∩EVT3=SVT2T3.

• For any two tasksT, T′ _{∈ T}_{, the set of external variables can either be disjoint, i.e.,}

EVT ∩EVT′ =∅, or can be accessed in a strictly predefined order as specified by

the task program. For example, ifv ∈ EVT ∩EVT′ 6= ∅exists andT usesv as

output variable andT′_uses_v_{as input variable, then}_T _{has to be executed before}_T′_. In particular, a temporal interleaved access tov∈EVT∩EVT′by bothTandT′is

not allowed.

To guarantee these constraints for external variables, tasks of a program may have to be executed according to a predefined execution order as specified by an input-output relations between the tasks. When taskT has to be executed before taskT′_{, this is} denoted byT → T′_{. Tasks that are not connected to an input-output relation can be} executed in any execution order, and they can also be executed concurrently to each other, see Fig. 1 for an illustration. The execution order resulting from the input-output relations is a relation between tasks. The entire set of relations between the tasks of a task-based program can be represented as a graph structureG= (V, E)whereV =T

andE captures the input-output relations between the tasks. This graph structure is also denoted as task graph. A task program is a valid task program if the task graph is a directed graph without cycles (DAG). Efficient execution orders of task graphs are considered in the section 3.

2.4 Coordination language

For the interactions between tasks, a coordination language is provided. The coor-dination language provides operators for specifying dependencies or independencies between task. For two tasks, the operator||is used to express independence, for more than two independent tasks, the operatorparforis provided. The operator◦expresses a dependence between two tasks.

In the declaration of a task, the external variables are declared in the form of input and output variables along with their corresponding type. We illustrate the use of the coordination language for two methods from numerical analysis.

Figure 2 shows a coordination program for the iterated Runge-Kutta (RK) method for solving systems of ordinary differential equations (ODEs). The method performs a series of time steps. In each time step, an s-stage iterated RK method performs

(6)

External task declarations:

StageVector(IN f: scal×vec(n)→vec(n), x: scal, y: vec(n), s: scal, A: mat(s×s), h: scal, V: list[s] of vec(n); OUT vnew: vec(n))

ComputeApprox(IN f: scal×vec(n)→vec(n), x: scal, y: vec(n), s: scal, b: vec(s), h: scal, V: list[s] of vec(n); OUT ynew: vec(n))

StepsizeControl(IN y: vec(n), ynew: vec(n); OUT hnew: scal, xnew: scal)

task definitions:

ItRKmethod(IN f: scal×vec(n)→vec(n), x: scal, xend: scal, y: vec(n), s: scal, A: mat(s×s), b: vec(s), h: scal;

OUT X: list[] of scal, Y: list[] of vec(n) ) = while( x<xend){

ItComputeStagevectors ( f, x, y, s, A, h ; V)

◦ComputeApprox ( f, x, y, s, b, h, V ; ynew)

◦StepsizeControl ( y, ynew; xnew, hnew)}

ItComputeStagevectors(IN f: scal×vec(n)→vec(n), x: scal, y: vec(n), s: scal, A: mat(s×s), h: scal;

OUT V: list[s] of vec(n)) = InitializeStage(y ; V )

◦for(j=1,...,m)

parfor (l=1,...,s) StageVector(f, x, y, V ; Vnew) Figure 2:Specification program of the iterated RK method.

a fixed numbermof iterations to computesstage vectorsv1_{, . . . , v}s

iteratively and uses the result of the last iteration to compute the next approximation vectorynew. In the figure, the stage vectors are computed in ItComputeStagevectors, the next approximation vector is computed inComputeApprox. The new step-sizehnewand

x-valuexnewfor the next iteration step are computed inStepsizeControl. The ODE system to be solved is represented by the parameterf. The computation of the stage vectors is performed by a sequential loop withmiterations where each iteration is a parallelparforloop withsindependent activations of the taskStageVector, which is not shown in detail. The parameter list contains IN and OUT variables, which form the set of external variables for that function.

Figure 3 shows a coordination program for extrapolation methods for solving ODEs. In each time step, different approximations are computed with different step-sizes, and these are then combined in an extrapolation table to obtain an approximation solution of higher order. In the figure, the taskEulerStepperforms the single micro-steps with different step-sizes. The micro-steps of one time step can be executed in parallel, ex-pressed by aparforoperator.

(7)

External task declarations:

BuildExtrapTable(IN Y: list[r] of vec(n); y: OUT vec(n)) ComputeMicroStepsize(IN j: scal, H:scal, r:scal; OUT hj:scal)

EulerStep(IN f: scal×vec(n)→vec(n), x: scal, y: vec(n), h: scal; OUT ynew: vec(n))

StepsizeControl(IN y: vec(n), ynew: vec(n); OUT hnew: scal, xnew: scal)

task definitions:

ExtrapMethod(IN f: scal×vec(n)→vec(n), x: scal, xend: scal, y: vec(n), r: scal, H: scal;

OUT X: list[] of scal, Y: list[] of vec(n) ) = while( x<xend)

parfor(j=1,...,r) MicroSteps( j, f, x, y, H ; yj)

◦BuildExtrapTable((y1,...,yr) ; ynew)

◦StepsizeControl(y, ynew ; Hnew, xnew)

MicroSteps(IN j:scal , f: scal×vec(n)→vec(n), x: scal, y: vec(n), H: scal; OUT ynew: vec(n))

= ComputeMicroStepsize(j, H, r ; hj)

◦for(i=1,...,j) EulerStep(f, x, y, hj; ynew)

Figure 3:Specification program of the extrapolation method.

3 Software architectures of task-based programs

The execution of task-based programs requires support at several levels. For the exe-cution of a single task, the task is mapped to a set of cores together with its internal variables. In addition, the corresponding external variables have to be made available. The coordination structure expresses the precedence relations between the tasks that must be met for a correct execution. The coordination structure does not specify an exact execution order of the tasks, but leaves some degree of freedom when tasks are independent of each other because of an empty setSVT T′ = ∅for two tasksT and

T′_{. In this case,}_T _and_T′ _{are independent of each other and can be executed at the} same time on different, disjoint sets of cores if this is beneficial for the resulting execu-tion time. This leads to a task scheduling problem: For a given coordinaexecu-tion structure, how can the tasks be mapped to the cores such that a minimum overall execution time results? To solve the scheduling problem, a task scheduler is integrated into the execu-tion environment. Thus, the specificaexecu-tion of a task program provided for an applicaexecu-tion program is separated from the actual assignment of tasks to cores for execution.

3.1 Task scheduler

The task scheduler accepts correct task programs in form of a specification (T,C). According to the coordination structureCand the state of the program execution, there are usually several tasks that can be executed next at each point of program execution. At program start, these are the tasks at the roots of the DAG representingC. In later

(8)

groups thread . . . . . . assignment task task assignment assignmenttask tasks T T T 1 2 n cores groups dynamic thread mapping thread mapping dynamic mapping thread dynamic

Figure 4:Two-step scheduling system with task assignment to thread groups of different size and the dynamic mapping of threads to free cores of the execution platform.

steps, there is a set of tasks that are ready for execution when tasks from which they depend are finished. The scheduler has knowledge about idle cores of the multicore platform and selects tasks for execution from the set of ready tasks. For the selection, the scheduler has several decisions to make. In particular, the number and the set of tasks that are executed next must be defined and the number of cores used has to be determined for each of the tasks selected. This selection depends both on the number of cores that are available and the size of the set of ready tasks. When only a small number of tasks is ready for execution, fewer tasks are selected, and each of these tasks gets a larger number of cores for execution. We refer to [24, 10, 16] for more details on appropriate scheduling algorithms.

A single task is a multi-threaded piece of code that can be executed by a number of threads. These threads can either be created for each task on free cores, or the threads can be kept alive between the execution of different tasks. In the following, we assume the latter case. The assignment of the threads to cores is done by a thread scheduler. In summary, a two-level scheduling system results which is a suitable and flexible execution scheme for an adaptive execution of task-parallel programs, see Fig. 4.

The execution of task-based programs with parallel tasks is supported by a soft-ware environment containing several components working either statically or dynam-ically. Main parts of this software environment are the front-end with a correctness checker and a dynamically working back-end two-level scheduler as described before, see Fig. 5. After reading the coordination specification and the declarations of paral-lel tasks, the correctness checker checks important properties such as the correctness of the DAG or the correctness of the types and declarations of parallel tasks. The resulting intermediate representation is created statically and can then be handled dy-namically by the scheduler. The scheduler requires feedback about the status of the execution platform concerning idle cores and about the status of the execution of tasks, e.g. whether the execution of a submitted task is finished. Based on this information the scheduler assigns new tasks to groups of cores and the internal thread scheduler

(9)

coordination set of tasks specification assignment task status of task execution dynamic status of execution platform parallel multicore machine dependence task graph

with external scheduler variables checker T C

Figure 5: Interaction between scheduler, execution platform and dependence checker for the construction of the setSVT T′, and dependence analysis between tasks.

handles the actual multi-threaded execution of a task on one set of cores.

In summary, the execution environment combines static compiler-based compo-nents for the translation of the specification into an intermediate format and dynamic parts for assigning parallel tasks to execution platforms at runtime. Both environments have been designed for parallel message-passing programs. The specific feature of the execution environment presented here is the two-level shared memory environment with internal and external variables. This gives rise to a two-level synchronization mechanism between threads on the one hand (supported by the thread scheduler in multi-threaded languages) and parallel tasks on the other hand.

4 Runtime experiments

Task-based executions not only facilitate the programming effort, they are also able to provide competitive runtime results compared to traditional parallel programming techniques. This is especially useful for compute-intensive applications. We will il-lustrate this for an application from numerical analysis, the iterated RK method from Section 2.4, for two different execution platforms with different execution characteris-tics: The first platform is a Xeon cluster consisting of two nodes with two Intel Xeon E5345 ”’Clovertown”’ quad core processors each. The processor run at 2.33 GHz and are connected by an infiniband network. The second cluster is the Chemnitz High Per-formance Linux (CHiC) cluster which is built up of 538 nodes, each consisting of two AMD Opteron 2218 dual core processors with a clock rate of 2.6 GHz.

Figure 6 (left) shows the execution times of one time step of an iterated RK method with four stage vectors on the Xeon cluster with 16 cores. As ODE system, a spatial dis-cretization of the 2D Brusselator equation is used, which describes the reaction of two chemical substances with diffusion. In particular, the figure compares a traditional data parallel implementation with a task-based implementation using four parallel tasks. In this configuration each parallel task is executed by four threads, which can be mapped in different ways to cores. For a consecutive mapping, the threads are mapped to the cores of one processor. For a scattered, cores of different processors are used.

Figure 6 (right) investigates the speedups achieved for different realizations on the CHiC cluster. The ODE system solved arises from a Galerkin approximation of a

(10)

200001800000 500000 720000 980000 1280000 1620000 2000000 0.5 1 1.5 2 2.5 system size

time per step in seconds

IRK with RadauIIA7 for brusselator on Xeon (16 cores)

1 M−task 4 M−tasks ort scattered 4 M−tasks ort mixed (d=2) 4 M−tasks ort mixed (d=4) 4 M−tasks ort consecutive

16324864 96 128 192 256 320 384 448 512 0 50 100 150 200 250 300 350 400 cores speedup

IRK−method with RadauIIA7 for schrödinger (n=128002) on CHiC

1 M−task 1 M−task OpenMP 4 M−tasks ort 4 M−tasks ort OpenMP

Figure 6:Left: Execution times of one time step of the IRK method using the four-stage RadauIIA7 method on the Xeon cluster for one M-task (data parallel) and four M-tasks (task parallel) with different mappings to cores.

Right: Comparison of the speedups for pure MPI realizations with a hybrid MPI+OpenMP implementation of the IRK method for a Schr ¨odinger equation with sys-tem size 128002 on the CHiC cluster.

Schr¨odinger-Poisson system that models a collisionless electron plasma. In particular, the figure compares hybrid execution schemes which use OpenMP for the parallel tasks within one node with pure MPI execution schemes. For the data parallel version, much higher speedups are achieved by using the OpenMP programming model within the cluster nodes. This hybrid parallel version even outperforms the orthogonal program version with an optimized task mapping. The main source of this impressive improve-ment is the reduction of the number of MPI processes to1/4. The best results are based on the program version with parallel working tasks using OpenMP intra node.

5 Related work

Task-based approaches have been considered at several levels and with different pro-gramming models in mind. Language extensions have been proposed to express the execution of tasks, including Fortran M [11], Opus [6], Braid [25], Fx [21], HPF 2.0 [13], Orca [12], and Spar/Java [23]. The HPCS language proposals, Sun’s Fortress [2], IBM’s X10 [7], and Cray’s Chapel [5], also contain some support for the specification of tasks. Moreover, skeleton-based approaches like P3L [18], LLC [9], Lithium [1], and SBASCO [8], as well as library-based and coordination-based approaches have been proposed.

The compiler-based static specification of parallel tasks is supported by the TwoL approach [19], which transforms a task-based specification of a parallel program step-by-step into an executable parallel program based on MPI. Task-internal communica-tion is also expressed by MPI. The dynamic creacommunica-tion and deployment of parallel MPI tasks is supported by the Tlib library [20]. Both approaches are in principle suited for multicore systems or clusters, but they rely on the fact that the MPI implementation uses the memory hierarchy of the execution platform efficiently.

(11)

The dynamic deployment of tasks is supported by several libraries. An example is the TPL library that has been developed for .NET [15] which supports the specifica-tion of tasks at loop level. Tasks arising from unstructured parallelism is supported via futures. All tasks are executed sequentially, the parallel execution of a single task is not supported. Load distribution is done through work stealing, see [4] for an analysis of this technique. Support for sequential tasks is also included in many other environ-ments, including Cilk [3] and Charm++ [14]. Support for task-based execution is also available in OpenMP 3.0, but tasks are executed sequentially by a single thread. [17]

6 Conclusions

The specification of parallel programs as a set of tasks that can cooperate with other tasks via input-output relations is a useful abstraction, since it relieves the programmer from the need to specify many low-level details of the parallel execution. In particular, the programmer does not need to specify an exact mapping of the computations to threads or processes for execution. Instead, these computations are assigned to tasks according to their natural algorithmic decomposition. A runtime system then brings the tasks to execution by dynamically selecting tasks if free execution resources are available.

The use of tasks has been particularly useful for expressing irregular applications, including particle simulation methods or computer graphics computations like radios-ity or ray tracing because of their dynamically evolving computation structure. In the traditional approach, single-processor tasks have been used that are executed by a single execution resource. The abstraction with tasks is also useful for programming multicore systems or multicore clusters. In this context, single-processor tasks can still be used, but it is more efficient and flexible to extend the approach to parallel tasks where a single task can be executed by multiple execution resources. This allows the mapping of a task to all cores or to a part of the cores of a node of a multicore system. The resources executing one task have to access a shared memory.

In this article, we have outlined this approach and have demonstrated its usefulness. In particular, we have shown how programs based on this approach can be expressed and how an execution environment can be organized. The use of parallel tasks leads to good execution times on large parallel systems.

References

[1] M. Aldinucci, M. Danelutto, and P. Teti. An advanced environment supporting structured parallel programming in Java. Future Generation Computer Systems, 19(5):611–626, 2003. [2] E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen, S. Ryo, G.L. Steele Jr., and S. Tobin-Hochstadt. The Fortress Language Specification, Version 1.0. Technical report, Sun Microsystems, Inc., march 2008.

[3] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. Journal of Parallel and Distributed Com-puting, 37(1):55–69, 1996.

[4] R.D: Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work steal-ing. J. ACM, 46(5):720–748, 1999.

(12)

[5] B.L. Chamberlain, D. Callahan, and H.P. Zima. Parallel Programmability and the Chapel Language. Int. J. High Perform. Comput. Appl., 21(3):291–312, 2007.

[6] B. Chapman, M. Haines, P. Mehrota, H. Zima, and J. Van Rosendale. Opus: A coordination language for multidisciplinary applications. Sci. Program., 6(4):345–362, 1997.

[7] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In OOPSLA ’05: Proc. of the 20th ACM Conf. on Object-oriented Programming, systems, languages, and applications, pages 519–538. ACM, 2005.

[8] M. Diaz, S. Romero, B. Rubio, E. Soler, and J. M. Troya. An Aspect Oriented Framework for Scientific Component Development. In Proc. of the 13th Euromicro Conf. on Parallel, Distributed and Network-Based Processing (PDP 2005), pages 290–296. IEEE, 2005. [9] A. J. Dorta, J. A. Gonz´alez, C. Rodr´ıguez, and F. de Sande. llc: A Parallel Skeletal

Lan-guage. Parallel Processing Letters, 13(3):437–448, 2003.

[10] P.-F. Dutot, T. N’Takpe, F. Suter, and H. Casanova. Scheduling Parallel Task Graphs on (Almost) Homogeneous Multicluster Platforms. IEEE Transactions on Parallel and Dis-tributed Systems, 20(7):940–952, 2009.

[11] I. T. Foster and K. M. Chandy. Fortran M: A Language for Modular Parallel Programming. J. Parallel Distrib. Comput., 26(1):24–35, 1995.

[12] S. Ben Hassen, H. E. Bal, and C. J. H. Jacobs. A task- and data-parallel programming language based on shared objects. ACM Transactions on Programming Languages and Systems(TOPLAS), 20(6):1131–1170, 1998.

[13] High Performance Fortran Forum. High Performance Fortran Language Specification 2.0. Technical report, Center for Research on Parallel Computation, Rice University, 1997. [14] L.V. Kale, E. Bohm, C.L. Mendes, T. Wilmarth, and G. Zheng. Programming Petascale

Applications with Charm++ and AMPI. In D. Bader, editor, Petascale Computing: Algo-rithms and Applications, pages 421–441. Chapman & Hall / CRC Press, 2008.

[15] D. Leijen, W. Schulte, and S. Burckhardt. The design of a task parallel library. In OOPSLA ’09: Proceeding of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applications, pages 227–242, New York, NY, USA, 2009. ACM. [16] T. N’takp´e, F. Suter, and H. Casanova. A Comparison of Scheduling Approaches for

Mixed-Parallel Applications on Heterogeneous Platforms. In Proc. of the 6th Int. Symp. on Par. and Distrib. Comp. IEEE, 2007.

[17] OpenMP Application Program Interface, Version 3.0. www.openmp.org, May 2008. [18] S. Pelagatti. Task and data parallelism in P3L. In F. A. Rabhi and S. Gorlatch, editors,

Patterns and skeletons for parallel and distributed computing, pages 155–186. Springer-Verlag, London, UK, 2003.

[19] T. Rauber and G. R¨unger. A Transformation Approach to Derive Efficient Parallel Imple-mentations. IEEE Transactions on Software Engineering, 26(4):315–339, 2000.

[20] T. Rauber and G. R¨unger. Tlib - A Library to Support Programming with Hierarchical Multi-processor Tasks. Journ. of Parallel and Distrib. Comput., 65(3):347–360, 2005. [21] J. Subhlok and B. Yang. A new model for integrated nested task and data parallel

program-ming. In Proc. of the 6th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 1–12. ACM Press, 1997.

[22] H. Sutter and J. Larus. Software and the Concurrency Revolution. 2005, 3(7):54–62, ACM Queue.

[23] C. van Reeuwijk, F. Kuijlman, and H.J. Sips. Spar: a Set of Extensions to Java for Scien-tific Computation. Concurrency and Computation: Practice and Experience, 15:277–299, 2003.

[24] N. Vydyanathan, S. Krishnamoorthy, G.M. Sabin, U.V. Catalyurek, T. Kurc, P. Sadayap-pan, and J.H. Saltz. An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications. IEEE Transactions on Parallel and Dis-tributed Systems, 20(8):1158–1172, 2009.

[25] E.A. West and A.S. Grimshaw. Braid: integrating task and data parallelism. In FRON-TIERS ’95: Proceedings of the Fifth Symposium on the Frontiers of Massively Parallel Computation (Frontiers’95), page 211. IEEE Computer Society, 1995.