Joint Process and Channel Mapping: The GBM Algorithm

6.4 Best-Effort Mapping and Scheduling

6.4.5 Joint Process and Channel Mapping: The GBM Algorithm

The mapping heuristics described in the previous section share similarities with commonly used algorithms. However, as discussed in Section 1.1.1, new communication architectures have made channel mapping equally important to process mapping. This section presents an algorithm that maps KPN processes and channels in no predefined order, but depend- ing on the application and the target platform.

The goal of a joint mapping process is to compute both µp and µc simultaneously,

obtaining a valid KPN mapping with minimum makespan. To solve this problem, a Group-

Based Mapping (GBM) algorithm is proposed with two main underlying goals: (1) Narrow

the mapping space while avoiding early selection of specific HW resources, (2) analyze jointly processes and channels in no predefined order.

To achieve the first goal the algorithm was split into two phases. In the first one, KPN elements are iteratively mapped to groups of HW resources, i.e., assignment sets from Definition 6.6. This reflects the fact that different resources may display the same timing characteristic for a given KPN element. In the second phase, a sort of homogeneous mapping is performed in which the actual HW resources are selected. To achieve the second goal, the algorithm was designed so that processes and channels are selected according to an improvement measure. This measure is not biased to processes or channels, thus, they are selected in no prescribed order.

The pseudocode of the GBM heuristic is shown in Algorithm 6.2. As mentioned before, the first phase in Lines 2–12 works on assignment sets of the KPN elements ( M e). These sets are reduced iteratively by calls to the function MakeProposal until no more re- ductions are possible. The function MakeProposal selects a KPN element e∗ and reduces its set from M e∗ to M ∗. The first call to Assess in Line 5 checks whether the reduction is feasible. If it is not, the proposed new set M ∗ is removed from the previous set M e∗ and the feasibility is checked anew (second call to Assess in Line 7). At the end of the first phase, every KPN element e has an optimized assignment set M e. The second phase then performs homogeneous mapping on these groups (see Line 13). Further details of the GBM algorithm are provided in the following sections.

6.4.5.1 Making a Proposal

In this context, making a proposal refers to the process of selecting a KPN element and refining its assignment set. The algorithm for making proposals uses the trace graph T GA introduced in Section 6.4.1. Since the buffer sizing process has already taken place, the trace graph is acyclic. Proposals are generated by analyzing the Dominant Sequence [224] of the trace DAG, i.e., the critical path of the partially mapped graph.

116 Chapter 6. Parallel Code Flow

Algorithm 6.2Group-Based Algorithm.

1: procedure GBM(MA _{= (PAE}A_{, KPN}A_{= (P}A_,_CA_{, varsize}₎₎_{, SOC}_{= (P E}_,_{CP )}_{, β)} 2: ∀PA∈ PA_, M PA _{← P E}_,_∀_CA _{∈ C}A_, M CA _{← CP}

3: whileCanMakeProposals({(e∈ PAEA, M e)})do

4: (e∗, M ∗) ←MakeProposal({(e∈ PAEA_, M e_)}) 5: (f ,{(e, M ′e)}) ←Assess({(e∈ PAEA_, M e_)}_,₍_e∗_, M ∗₎₎ 6: if f =Falsethen

7: (f ,{(e, M ′e)}) ←Assess({(e∈ PAEA_, M e_)}_,₍_e∗_, M e∗₋ M ∗₎₎ 8: if f =Falsethen fatal error; return(∅_{, ∅}₎

9: end if

10: end if

11: ∀e∈ PAEA, M e ← M ′e

12: end while

13: returnDoHomogeneousMapping({(e∈ PAEA, eM )}) →returns the KPN mapping(µp, µc) 14: end procedure

15: procedure Assess({(e, M e)},(e∗, M ∗))

16: {(e, M ′e)} ←Propagate({(e∈ PAEA_, M e_)}_,₍_e∗_, M ∗₎ 17: if∃e∈ PAEA, M ′e=∅_{then return}₍False_{, ∅}₎ 18: end if

19: f ←LoadControl({(e∈ PAEA, M ′e)}) 20: if f =False_{then return}₍False_{, ∅}₎ 21: end if

22: ife∗∈ CA_then _→_{extra check for communication channels}

23: {(e, M ′e)} ←ConsistCheck({(e∈ PAEA_, M ′e_)}_{, e}∗₎ 24: if∃e∈ PAEA, M ′e=∅_{then return}₍False, ∅) 25: end if

26: end if

27: return(True_,_{(_e, M ′e_)}) 28: end procedure

At every call to MakeProposal, the critical path of the trace DAG is determined. This is done by using the timing provided by the slowest resource in the assignment set

M _e

, e ∈ PAEA. The algorithm compares the impact of alternative assignments for the nodes in the critical path (better mapping than the slowest resource). The KPN element corresponding to the trace graph nodes for which this impact is the highest is selected. The function then returns this element together with its assignment set reduced to the resource group that produced the highest improvement. Once all the elements in the critical path have been assigned to a group that contains only resources of the same type, no more proposals can be done. In this case, a call to the function CanMakeProposals returns false (see Line 3 in Algorithm 6.2).

6.4.5.2 Mapping Propagation

After a proposal has been made, the function Propagate in Line 16 of Algorithm 6.2 up- dates all the assignment sets accordingly. Note that if the assignment set of a KPN element is reduced, the assignment sets of other elements must be updated. As an example, consider a process whose assignment set is reduced from the set of all processors ( M P = P E ) to a single processor ( M ′P = {PEi}). As a consequence, all ingoing and outgoing commu-

6.4. Best-Effort Mapping and Scheduling 117 as destination and source, respectively. Again, as an example, consider a communication channel C whose assignment set is reduced to a single primitive ( M C = {CPij}) over a HW

FIFO. In such extreme case, the assignment sets of the source and destination processes become M src(C) = {PEi} and

dst(C) _{= {PE}

j}. As it can be seen, the update process has to

be propagated throughout the application graph. This process continues recursively, until no more change in the assignment sets is observed.

6.4.5.3 Load Control

The function LoadControl in Line 19 of Algorithm 6.2 prevents proposals from always selecting the same group of resources. For this purpose, a measure of the occupation of a group has to be defined. This measure has to take into account that the bigger the KPN elements are, the more difficult it is to distribute them within a resource group. Consider for example an accumulated total memory requirement of 100 kB to be distributed on 3 memories of 40 kB each. While it is easy to distribute 20 small channels of 5 kB among the group, it is impossible to do it for 4 bigger channels of 25 kB. A similar observation holds for processes, where the size relates to the computation time. In the case of HW FIFOs, the function checks that there are no more KPN channels assigned than HW FIFOs in the group.

Consider a group of resources G ⊂ (P E ∪ CR) with associated mapping set MG ⊂ PAEA, i.e., a set of processes or channels mapped to the group G (see Definition 6.7). Generally speaking, if |MG| > |G|, the load control ensures that the utilization of the group U stays below a variable threshold:

U < kx · |M

G_|

|MG_{| + |G| · (k}

x/ky− 1)

(6.3)

The threshold is controlled by the amount of KPN elements and the amount of re- sources. The parameters kx ∈ (0, 1), ky ∈ (0, kx) provide further control. In the extreme

cases |MG| = |G| and |MG_{| >> |G| the utilization is compared against k}

y and kx respec-

tively. The values of the parameters were determined empirically. For memories, they are

kx = 1, ky = 0.5 and for processors kx = 0.95, ky = 0.75.

The utilization U for a group of memories is determined by the ratio of the required and the available memory, i.e.,

Umem = P C∈MG β(C) ·var_size(C) P CR∈GxCRMEM (6.4)

For processors it is more involved, since the quantity available time is not as well defined as the available memory. The LoadControl module uses an estimation of the makespan

T to model the available time. This estimation is obtained by performing list scheduling

on the trace DAG. With this, the utilization of a group of processors is defined as the ratio of the required computation (cumulative segment costs) and the total available computing time, determined by T cycles on each of the processors of the group. More precisely, for a group G of processors of type PT,

Uproc=

P∈MGζ_tracePT (P)

118 Chapter 6. Parallel Code Flow

6.4.5.4 Consistency Check

When a KPN channel is assigned to a group of CRs, inconsistencies may appear. This is controlled by the function ConsistCheck in Line 23 of Algorithm 6.2. Consider a group of HW FIFOs G_fifo that sparsely connect processors within a group Gproc. In this case, it

is not enough to perform local checks, i.e., whether all producers and consumers of the KPN channels mapped to G_fifoare mapped to Gproc. It has to be further checked that there

actually exists at least one fixed feasible mapping.

6.4.5.5 Homogeneous Mapping

After the heterogeneous phase, the assignment sets of the nodes in the critical path of the trace DAG consist only of resources of the same type. For these nodes, the problem is reduced to multiple smaller homogeneous mapping problems, where the main concern is to decide how to share hardware resources. Since sharing processors usually has a higher impact on the runtime than sharing communication resources (e.g., memories), this phase focuses on process mapping. Channel mapping can be done, in this case, as an afterthought, as long as the assignment sets are correct. The same holds for the KPN elements outside the critical path.

Any homogeneous mapping algorithm can be used in this phase, like those derived from the bin packing algorithm. The function in Line 13 of Algorithm 6.2 uses the informa- tion collected in the previous phase to compute the final mapping. First, note that due to constraits imposed by channel assignment sets, mapping a process PA ∈ PA _{will some-}

times imply mapping a group of processes GPA ⊆ P. Processes are sorted primarily by the size of these induced sets (|GPA|) and then by the size of the assignment sets (| M PA|). The bigger the induced group and the smaller the assignment set, the earlier the process has to be mapped. The list of processes in the critical path of the trace DAG is sorted according to the aforementioned criteria. Every process P∗ is mapped to the processor

PE∗ on which a local scheduler on the trace DAG reports the best finishing time. In order to reduce the complexity of this mapping phase, the local scheduler considers only segments of processes that are already mapped to the target processor PE∗ and ignores other dependencies. More precisely, let ALAP(S_iP∗) be the As Late As Possible time of seg- ment SP_i∗, obtained during the critical path computation on the trace DAG. Let tPE_SP∗

be the time computed by the local list scheduler for the same segment on every processor in the assignment set, i.e., PE ∈ M P∗. The final mapping is determined by µgbmp (P∗) = PE∗, with

PE∗= argmin PE∈ M P∗ X S_iP∗∈TP∗ max(0, t_SPEP∗ i − ALAP(SP∗ i )) (6.6) The term tPE SP∗ i − ALAP(SP∗

i ) can be seen as a penalty for processors that do not have

free time slots that allow to schedule the segment before its theoretical ALAP time. After every fixed mapping decision, the tests described in Sections 6.4.5.2–6.4.5.4 are performed.

In document Programming heterogeneous MPSoCs : tool flows to close the software productivity gap (Page 125-128)