Deriving Architecture and Mapping Specifications

So far, we have shown how to derive the parameters of the taskset corresponding to a given CSDF graph. Recall from Figure 1.7 on page 14 that the scheduling framework produces two outputs that are used as inputs to the system-level synthesis step. These outputs are the architecture and mapping specifications. The architecture specification refers to the number of processors needed to schedule the taskset, while the mapping specification refers to the allocation of tasks to processors. Suppose that we want to run a set of programs modeled by a set of graphsG= {G1,G2,⋯,G_N}on a platform.

LetT(G_i)denote the taskset corresponding to the actors ofG_i. Then, we defineTto

be asupertaskset given by

T= ⋃ Gi∈G

T(G_i) (4.52)

After that, we apply one of the partitioning schemes described in Section 2.4.4 onT. This results in am-partition ofT, denoted bymT. Then, the architecture specification states that the system consists ofmprocessors, while the mapping specification states that the allocation of tasks to processors is given bymT. To illustrate these steps, we give the following example.

4.8. Deriving Architecture and Mapping Specifications 71 acyclic CSDF graph G matched I/O? (see pp. 51) ℛ_PS = ℛ_WSTS ℛ_PS < ℛ_WSTS balanced? (see pp. 51) ℒ_PS = ℒ_WSTS ⃗qˇ = ⃗1? ℒ_PS = ℒ_WSTS if⃗η = ⃗0 Use ⃗ηto reduceℒ ℒ_PS ≥ ℒ_WSTS Yes No Yes No Yes No

Figure 4.13:Decision tree for scheduling CSDF actors as real-time periodic tasks.ℛ_PSandℒ_PS refer to the throughput and latency, respectively, whenP⃗₌P.⃗ˇ

Example4.8.1. Consider the programs shown in Listing 1 (on page 6) and Listing 2. Applying the automated parallelization and model construction steps explained in Chapter 3 to both programs results in the CSDF graphs shown in Figures 2.1 and 4.14. We denote the graph shown in Figure 2.1 byG1and the one shown in Figure 4.14 by G2. Recall from Example 4.3.1 that the execution time vector ofG1isC⃗_{= (︀}_{5, 8, 24, 4}_⌋︀T_.

72 Chapter 4. Scheduling Framework

Listing 2Another example of a SANLP in C 1 int main() 2 { 3 for (i = 0; i < 100; i++) { 4 for (j = 0; j < 100; j++) { 5 in(&x); 6 g1(x, &y); 7 g2(y, &z); 8 out(z); 9 } 10 } 11 return 0; 12 } A1(2) A₂(4) A₃(7) A₄(1) E1 E2 E3 (︀1⌋︀ (︀1⌋︀ (︀1⌋︀ (︀1⌋︀ (︀1⌋︀ (︀1⌋︀

Figure 4.14:The CSDF graph corresponding to the SANLP program in Listing 2. A1corresponds

toin, A2corresponds tog1, A3corresponds tog2, and A4corresponds toout. The numbers

between the parentheses are the WCET of the actors. For example, the WCET of actor A2is 4.

Graph C⃗ P⃗ S⃗ ⃗b u

sum(T) G1 (︀5, 8, 24, 4⌋︀T (︀8, 12, 24, 8⌋︀T (︀0, 8, 24, 32⌋︀T (︀2, 2, 5, 3, 2⌋︀T 2.7916 G2 (︀2, 4, 7, 1⌋︀T (︀7, 7, 7, 7⌋︀T (︀0, 7, 14, 21⌋︀T (︀2, 2, 2⌋︀T 2.0 Table 4.5:The taskset parameters for G1and G2assuming µG =1and⃗η= ⃗1for both graphs.D is⃗ omitted sinceD⃗ _{= ⃗}P.

⃗

C= (︀2, 4, 7, 1⌋︀T. Assume that the period scaling factor for both graphs is 1 (see (4.17)

on page 54) and the deadline factor for both graphs is 1.0 (see (4.22) on page 58). Then, we compute the periods, start times, and buffer sizes of both graphs as shown in Table 4.5. We see from Table 4.5 that the total utilization ofT(G1)andT(G2)is equal to

4.7916. Thus, the absolute minimum number of processors needed to scheduleG1and G2on a system, assuming an optimal scheduling algorithm, is given by (2.20) on page 35 and is equal to ˇmOPT= [︂4.7916⌉︂ =5. Under partitioned scheduling, the minimum

number of processors needed to scheduleG1andG2on a system is given by (2.21) on page 36.

Suppose that EDF is used as a scheduling algorithm and FFD (see Section 2.4.4) is used as an allocation algorithm. Then, the resulting mapping is shown in Figure 4.15. In this case, the minimum number of processors needed to scheduleG1andG2is

4.8. Deriving Architecture and Mapping Specifications 73 π1 π2 π3 π4 π5 π6 Tsrc u=0.63 Tf1 u=0.67 Tf2 u=1.0 Tsnk u=0.5 Tin u=0.29 Tg1 u=0.57 Tg2 u=1.0 Tout u=0.14

Figure 4.15:Mapping of G1and G2onto 6 processors assuming EDF and FFD. πidenotes the

ith_{processor and u denotes the total utilization of a task. Each task corresponds to a function}

in either Listing 1 or 2. For example, Tg1corresponds to functiong1in Listing 2. The arrows represent the communication between the tasks.

equal to 6. Figure 4.15 contains both the architecture and mapping specifications needed for performing the system-level synthesis in Chapter 5. It states how many processors are needed (i.e., 6) and the mapping of each task to processors together with

Chapter 5 System-Level Synthesis

Build a system that even a fool can use, and only a fool will want to use it.

George Bernard Shaw

S

YSTEM-level synthesis represents the fifth step in the proposed design flow. The inputs to this step are the program, architecture, and mapping specifications. The program specification consists of the PPNs derived in Chapter 3 together with the tasks’ parameters and buffer sizes derived in Chapter 4. The architecture specification describes the number of processors as derived in Section 4.8. Finally, the mapping specification, derived in Section 4.8, associates each task with the processor on which it runs. All these specifications are used, as shown in Figure 5.1, to generate the MPSoC platform which consists of the hardware part together with the software running on that hardware. The whole system-level synthesis procedure is performed using ESPAM [NSD08]. ESPAM is a system-level synthesis tool for streaming systems with support for model-based design. We extended ESPAM in order to: (1) generate the hardware architecture explained in the following section, (2) generate the software with the proper scheduling and communication infrastructures explained later in Section 5.2, and (3) add support for Xilinx ML605 and Zynq boards that are used later in Chapter 6 for prototyping the synthesized systems.

5.1 Hardware

In this dissertation, we consider a hardware platform consisting of a tiled distributed memory MPSoC as shown in Figure 5.2. The on-chip interconnect is assumed to be apredictableon-chip interconnect. A predictable on-chip interconnect is one that

76 Chapter 5. System-Level Synthesis Program Specifications Architecture Specifications Mapping Specifications System-Level Synthesis (ESPAM) RTL + Software 5 ○ 5 ○ 5 ○

Figure 5.1:Electronic System-Level Synthesis

Tile Tile Tile

On-Chip Interconnect

Shared Memories Peripherals

...

Figure 5.2:Top-level block diagram of the hardware platform considered in this dissertation

provides bounded worst-case latency on read/write operations between any communi- cating source and destination pair in the SoC. An example of such interconnect is the Æthereal network-on-chip [GDR05]. The aforementioned assumption is necessary in order to compute in Section 4.3 a safe upper bound on the worst-case execution time of each actor.

Each tile consists, as shown in Figure 5.3, of a processor, several memories, and a timer. Each tile contains three dedicated memories:

• Program Memory (PM) to store the programs’ binaries

• Data Memory (DM) to store the data segments, heap and stack

• Communication Memory (CM) to store the data sent to other processors Each processor writes the processed data to its local communication memory. After that, remote consumer processors read this data from the communication memory of the producer processor. This means that data writes are always local, and data reads are either local or remote depending on the actual mapping of tasks to processors. All the memories are implemented as dual-port memories which means that the commu-

In document On hard real-time scheduling of cyclo-static dataflow and its application in system-level design (Page 85-92)