Underlying optimization problems - Systematic Design Space Exploration of Dynamic Dataflow Prog

The design space exploration problem consists of partitioning, scheduling and buffer dimensioning. These subproblems are illustrated in Fig. 6.1. The solutions (configurations) applied to these subproblems lead to defining a fixed execution order of the firings that compose the execution and hence restrict the topological order in the execution trace graph by introducing the configuration-related dependencies, as described in Section 4.2.2.

Defining such an execution order for the case of an AT S program is much more complex than for the case of K P N or DP N programs. For a K P N program the execution order is defined directly for the set of processes composing the program [84]. For a DP N program it is a sequence of actor firings the execution order has to be defined for [147]. The case of AT S programs requires defining the execution order for a sequence of atomic steps, where each step consists of an execution of a firing function controlled not only by the availability of input tokens, but also by state variables and priorities, as summarized in Section 2.1.4.

Figure 6.1 – The partitioning, scheduling and buffer dimensioning subproblems.

6.2.1 Partitioning

Using the terminology coming from the production field, the problem is to assign n jobs (corresponding to the action firings) to m parallel machines (also referred to as partitions or processing units). Each job j has an associated processing time (or an action weight) pjand it belongs to a group (or an actor) gj. There are l possible groups, denoted as g1, g2, . . . , gk. If

g2= 3, it means that job 2 belongs to group g3. Each group gjcan be divided into subgroups, where all jobs have the same processing times and thus can be identified with different executions of the same action. Between some pairs { j , j0} of incompatible jobs (i .e., with

gj 6= gj0) is associated a communication time w_{j j}0. It is subject to a fixed quantity q_{j j}0 of

information (or the number of tokens) that needs to be transferred. The size of this data is fixed for any subgroup.

The partitioning problem can be represented by an acyclic directed graph G = (V, A), with the vertices (or nodes) set V and the arcs set A. Each vertex or node j represents a job and each arc ( j , j0) (between two nodes of the same group or not) represents a precedence constraint. With each arc ( j , j0) such that gj6= g0_j is associated a communication time (or a communication

weight) wj j0. With each arc ( j , j0) such that g_j= g_j0, no weight or cost is associated. It can be

observed that the relative order of execution of the nodes belonging to the same group is quite constrained (i .e., the decision space is very restricted for such arcs) and is, in fact, imposed by the input stimulus used to build the graph. Finally, a group constraint can be defined, which implies that all jobs belonging to the same group have to be processed on the same machine. In other words, all executions of a particular actor must be partitioned to the same machine.

6.2.2 Scheduling

Typically, only one job can be executed at a time on one machine. Hence, for each assignment of jobs to the machines, the order of execution (i .e., the sequencing) of jobs S =© j, j0_{, . . .ª must}

be decided in each machine. In practice, the sequencing of jobs within one group is quite constrained. Therefore, it can be reasonably assumed that there is almost no impact on the resulting makespan if some permutations are performed in the sequencing of each group. In this case, the scheduling problem can be limited to choosing at each step a group for which a node should be executed. The eligibility for execution for each node is determined by the availability of the necessary input tokens and spaces in the outgoing buffers. Depending on the internal nature of each actor (i .e., static or dynamic) and on the underlying structure of its nodes, an optimal static order may or may not exist. In the situation where there are several jobs available for an execution on one machine, the selection of one of them is very sensitive to the solution’s quality.

The sequencing must take into account two constraints. The precedence constraint ( j , j0₎

means that the job j (plus the associated communication time) must be completed before the job j0is allowed to start at the time point Tst ar t( j0). The setup constraint requires that for each existing arc ( j , j0_{) involving nodes from different groups, a setup (or communication) time}

wj j0is occurring. In contrast with the job scheduling literature [148], one can observe that a

setup time also occurs if the involved jobs are assigned to two different machines.

6.2.3 Buffer dimensioning

Each communication channel i (buffer in the network) that the information (tokens) is being transmitted through is bounded by Bi, so that a configuration B (bi) = Biis specified. More precisely, the sum of the qj j0’s along any arc assigned to a particular buffer i cannot exceed B_i.

because if the necessary space is not available, this job cannot be executed (i .e., jS6= j ). The optimal size should be set independently for each buffer in the network so that the delays of job executions arising from unavailability of the space in the buffers are minimized.

6.3 Target platforms

6.3.1 Homogeneous platforms

In the case of a homogeneous platform, each job j must be performed on any of the m parallel identical processing units (machines). The associated processing time pj(action weight on the platform) is thus the same on each processing unit. A group riis assigned to each machine

i . This definition remains consistent with the construction of the Non-Uniform-Memory- Access (NUMA) architectures, where the processing units (cores) are grouped and connected

to different memory banks [149], as depicted in Fig. 6.2a for the case of Intel Xeon X5650 (example of a NUMA architecture) considered in this work as a homogeneous platform. The communication time wj j0 required for transferring a given number of tokens consists

of the product of two elements: the (previously described) quantity qj j0 and the variable

time wj j0(h( j ), h( j0)) needed to transfer a single unit of information (where h( j ) denotes the

machine the job j is performed on). The value of wj j0can belong to one of the three cases:

(1) j and j0are scheduled on the same machine (communication via the L1-L3 caches or the local memory); (2) j and j0 are scheduled on machines of the same group (L3 cache or local memory); (3) j and j0are scheduled on machines from different groups (remote memory) [132]. It can be assumed that case (3) will always introduce much higher values of wj j0 than cases (1) and (2), whereas case (2) is likely to have a higher latency than case

(1). This will depend, however, on the communication demands of the actors in a specific partitioning configuration. Figure 6.2b presents a sample assignment of four groups (actors) to a set of homogeneous machines corresponding to the mentioned earlier example of the

NUMA architecture.

6.3.2 Heterogeneous platforms

Heterogeneous platforms imply considering different families of processing units. In each family, all processing units (machines) are identical, but the processing units of one family are not necessarily faster than the processing units of another. Typically, there are two families:

HW (for hardware) and SW (for software). If only family SW is considered, the extended

problem is reduced to the problem of homogeneous machines described previously. Other- wise, a job j has the same processing time on all the SW processing units (denoted pj(SW )), and another processing time on all the HW processing units (denoted pj(HW )). The actors

(a) Platform example: Intel Xeon X5650.

(b) Sample assignment.

assigned to the HW processing units work in parallel. Thus, the scheduling problem for a HW subset of machines is eliminated without having any impact on the makespan. Figure 6.3a illustrates the basic construction of Xilinx Zynq 7000, which is an example of a heterogeneous platform. The two ARM machines, along with the associated memories (L1, L2, DRAM) belong to the SW family, whereas the AXI Masters component denotes the set of HW processing units. Handling heterogeneous platforms introduces different figures of merit for the communica- tion time. The wj j0’s are all equal to 1 (small value) if the involved groups are assigned to HW

(assuming internal communications for hardware modules). The wj j0’s are represented with

different levels of values (depending on the assignment to the processing units, similar to the homogeneous platform case) if the involved groups are assigned toSW . If one group is

executed onHW and the other on SW , wj j0 depends on the amount of information to be

sent.

The constraints introduced in this case are mostly subject to the HW family. First of all, an

eligibility constraint occurs, meaning that a HW processing unit cannot perform all the jobs

(i .e., there is a set d (i ) of unsupported operations for each processing unit i ), for instance the floating point operations. Furthermore, there is a capacity constraint due to the limited memory size of the HW family. Each group g (buffer b) has a memory requirement of mem(g ) (mem(b)), respectively. The sum of mem(g ) and mem(b) for all groups and buffers assigned to the HW processing units cannot exceed a given limit. Finally, the buffer mapping constraint implies that the number of possible connection paths cnp between SW and HW (HP compo- nents in the aforementioned example) is fixed to H Pmax. The maximal number of buffers that can be mapped to one connection path is fixed to Bmax.

Figure 6.3b presents a sample assignment of four groups (actors) to the, mentioned previously, example of a heterogeneous platform.

In document Systematic Design Space Exploration of Dynamic Dataflow Programs for Multi-core Platforms (Page 109-113)