Case-study: Mapping 2D-IDCT Specification to IP-primitives

3.6 Related Work

4.5.2 Case-study: Mapping 2D-IDCT Specification to IP-primitives

”Create the flexible one-on-one mappings of a particular application Kahn PN onto: (1) multiprocessor with compile-time pipelining, and (2) multi-processor with run-time pipelining of symbolic instructions9, where: (1) the application process network is pre-created and cannot be changed, and (2) the granularity of computations and communications between the application model and the chosen architecture models differ significantly.”

The restriction on changing application process networks implies that the mapping transformations can happen only at the level of symbolic programs. The restriction on granularity enforces the Refining Step (see Section 4.4.3). Thus, in this sub-section we give (1) a de- scription of the mapping case we conducted and (2) a refinement of the 2D-IDCT mapping model without modifying the high-level specification. It is worth noting that with this case we model ’fictive’ architectures, and due to this, we cannot reason about the accuracy of the case results as we did with the case in Section 4.5.1. Here, we can reason only about performance impacts of refinement and transformation choices on the simulated system in isolation, i.e, in the scope of simulation models, and which are already available in [75].

The Two Dimensional Inverse Discrete Cosine Transform (2D-IDCT) is part of image com- pression methods, one of which is a standard described in [81]. 2D-IDCT appears in many Multimedia applications and is a critical path function [50].

At some level of abstraction, the 2D-IDCT application is specified as a 3-process PN [18], as shown in Figure 4.5 (processesSourceandSinkdo not play any role here - they are illustrated for the sake of delivering input data and collecting output data).

Source Sink

cols IDCT1d rows

IDCT1d _Transpose

FIFO1 FIFO2 FIFO3 FIFO4

Figure 4.5: The 2D-IDCT Kahn Process Network

In this graph,1D-IDCTis the One Dimensional Inverse Discrete Cosine Transform, which transforms a time-domain block of 8×8 image pixels to a frequency-domain block of 8×

8 values in a row-by-row fashion. Transposeperforms the transpose of the output blocks of the first1D-IDCTand then delivers the transposed blocks to the second1D-IDCT. The second1D-IDCTthen also applies row-by-row transformations, which, due to the transposi- tion, corresponds to a column-by-column transformation on the output of the first1D-IDCT. The (unbounded) channels between the two producer-consumer pairs exchange these blocks.

Symbolic Program Representation of the 2D-IDCT Khan PN

The listing in Figure 4.6 below, is a Symbol Program that represent all three processes in Figure 4.5. 1 main { 2 loop condition 0 (iN) 3 { 4 read m (i0, 64); 5 execute f (i0 o0, b); 6 write n (o0, 64); 7 } 8 }

Figure 4.6: Symbolic program template for both the IDCT1D and the Transpose tasks. The numbers “64” and “b” indicate the token size in number of pixels read from (written to) the input (output) channel, and the execution budget, relative to thereadandwrite

execution budgets, of the function being executed by the process, respectively. The 1D- IDCT and the Transpose functions have different budgets10_{. As indicated earlier, each SP is}

associated with an accompanying control trace. From the structure of the SP in Figure 4.6, it can be seen that the corresponding control trace is trivial in this case because there is only one control point.

Architecture Specifications & Mapping Descriptions

We conducted two different experiments: (1) mapping of the 2D-IDCT specification onto a multiprocessor with compile-time pipelining processors, and (2) mapping of the 2D-IDCT specification onto a multiprocessor with run-time pipelining processors. In these experiments, we use the following architecture plus mapping specifications (for each mapping there is one architecture plus mapping specification):

1. Binding: We assumed a 1-on-1 mapping of SPs onto processors and application chan- nels onto FIFO components.

2. Binding: There is no resource sharing, neither for computation (an operating system is not needed) nor for communication (a bus is not needed since all buffers are dedicated). 3. Refining: 1D-IDCT processors operate on rows (8-pixel data-token) rather than on blocks (8×8 = 64data-token). That is, the architecturecheck-dataandcheck-

roomFIFO synchronization primitives operate on rows, and thesignal-room, and

signal-dataFIFO synchronization primitives operate on blocks. Conversely, in the

other tasks (Source, Transpose, Sink) thecheck-dataandcheck-roomsynchro- nization primitives operate on blocks, and thesignal-room, and signal-data

synchronization primitives operate on rows.

10_{The execution budget parameter of the 1D-IDCT tasks is 8, and the execution budget parameter of the Transpose}

4. Refining: 1D-IDCT implementations in the processors are as in [82], and can be rep- resented by a sequence of four differentexecutesymbolic instructions.

5. Matching: FIFO buffers in the architecture are sized such that they provide enough space (in this case study, for the three mappings the FIFO buffer sizes are always 256 tokens).

Compile-time Transformation of Symbolic Programs

E 4 W n W n W n W n W n W n W n W n time loop iterations R m E 1 E 2 E 3 E 4 R m E 1 E 2 E 3 E 4 R m E 1 E 2 E 3 E 4 R m E 1 E 2 E 3 E 4 R m E 1 E 2 E 3 E 4 R m E 1 E 2 E 3 E 4 R m E 1 E 2 E 3 E 4 R m E 1 E 2 E 3

Figure 4.7: The transformed loop body of the IDCT1D SP

The idea of this transformation is to schedule the execution of the SP shown in Figure 4.6 as shown in Figure 4.7. After applying this transformation the SP template has changed and the resulting SP template is shown in Figure 4.8.

The transformation illustrated in Figures 4.7 and 4.8 is so-called “software pipelining”, al- lowing overlapping of symbolic instructions at run-time [56]. Each symbolic instruction, delimited by the ”;” terminal, may express parallelism (mutual independence) among symbolic operations delimited by the ”||” terminal. This implies that no dependency checks (argument checking) in the architecture model are performed at run-time. Notice that mutually inde- pendent symbolic operations in an explicit parallel symbolic instruction need not have equal evaluation times. The next symbolic instruction is only scheduled when the slowest symbolic operation in the current symbolic instruction terminates.

Run-time Transformation of Symbolic Programs

Another way of modeling the behavior shown in Figure 4.7 is to detect at run-time a pos- sible overlapping ofread,execute, andwritesymbolic instructions. The appropriate

processor model is provided in Chapter 3 Section 3.5.2. The transformation applied here is a simple refinement: an expansion of the loop-body in the SP template shown in Figure 4.6. The processor model, on the other hand, is now more complex because it has to produce at run-time the pipelined execution order (compared to the compile-time processor where the compiler is more involved). The result of the refinement of the SP is shown in Figure 4.9. It is worth noting that the execution flow of this SP is the same as in Figure 4.7.

1 main { 2 loop condition 0 iN 3 { 4 read m (i0, 8); 5 6 read m (i0, 8) || execute f1 (i0 I1, 2); 7

8 read m (i0, 8) || execute f1 (i0 I1, 2) || execute f2 (i1 I2, 2); 9

10 read m (i0, 8) || execute f1 (i0 I1, 2) || execute f2 (i1 I2, 2) || 11 execute f3 (i2 I3, 2);

13 read m (i0, 8) || execute f1 (i0 I1, 2) || execute f2 (i1 I2, 2) || 14 execute f3 (i2 I3, 2) || execute f4 (i3 o0, 2);

16 read m (i0, 8) || execute f1 (i0 I1, 2) || execute f2 (i1 I2, 2) || 17 execute f3 (i2 I3, 2) || execute f4 (i3 o0, 2) || write fn (o0, 8);

19 read m (i0, 8) || execute f1 (i0 I1, 2) || execute f2 (i1 I2, 2) || 20 execute f3 (i2 I3, 2) || execute f4 (i3 o0, 2) || write fn (o0, 8);

22 read m (i0, 8) || execute f1 (i0 I1, 2) || execute f2 (i1 I2, 2) || 23 execute f3 (i2 I3, 2) || execute f4 (i3 o0, 2) || write fn (o0, 8);

25 execute f1 (i0 I1, 2) || execute f2 (i1 I2, 2) || execute f3 (i2 I3, 2) || 26 execute f4 (i3 o0, 2) || write fn (o0, 8);

28 execute f2 (i1 I2, 2) || execute f3 (i2 I3, 2) || execute f4 (i3 o0, 2) || 29 write fn (o0, 8);

31 execute f3 (i2 I3, 2) || execute f4 (i3 o0, 2) || write fn (o0, 8);

32 33 execute f4 (i3 o0, 2) || write fn (o0, 8); 34 35 write fn (o0, 8); 36 } 37 }

1 main {

2 loop condition 0 (iN) // the original loop

3 {

4 loop condition 1 (iM) // the newly introduced loop

5 {

6 read m (i0, 8); // the refined read (lines)

8 execute f1 (i0 I1, 2); // the first pipeline stage 9 execute f2 (i1 I2, 2); // the second pipeline stage 10 execute f3 (i2 I3, 2); // the third pipeline stage 11 execute f4 (i3 o0, 2); // the fourth pipeline stage 12

13 write n (o0, 8); // the refined write (lines)

14 }

15 }

16 }

Figure 4.9: Unrolled symbolic program template.

In document Execution platform modeling for system-level architecture performance analysis (Page 103-107)