Results - Dynamic task management in MPSoC platformson

In this section, hardware results based on the gate-level synthesis for the OSIP core are reported and compared with those of LT-OSIP, and a preliminary performance analysis is performed, considering the execution time of the most frequently used and critical OSIP commands. Especially, the area and energy efficiency of OSIP and LT-OSIP are compared, using two widely used cost metrics – Area-Time product (AT) and Area-Time-Energy product (ATE).

3.3.1 Area, Timing and Area-Time Product (AT)

The OSIP core is synthesized with Synopsys Design Compiler [187] using a 65 nm standard cell library under typical operation conditions (supply voltage 1.0 V, tem- perature 25◦C). The area and maximum achievable clock frequency are given in Table 3.1. As a comparison, the synthesis results of LT-OSIP are also given in the table. The

area of memories (program and data memory) is excluded from the reported area in the table, due to missing memory libraries at this technology during this work.

Table 3.1: Synthesis results of OSIP and LT-OSIP

OSIP LT-OSIP

Total area (kGE) 35.7 23.1

Combinational (kGE) 25.2 14.8

Sequential (kGE) 10.5 8.3

Max. clock frequency (MHz) 690 690

As shown in the table, OSIP and LT-OSIP achieve the same maximum clock frequency. In both processors, the critical path is caused by the read access to the data memory in the MEM-stage. This is because the output data delay from the data memory is considerably long, if a large static array is allocated for the OSIP_DTs.

In contrast to the same maximum clock frequency, OSIP has much larger area than LT-OSIP. An area increment of 54% is reported, caused by the special hardware features in the OSIP core. However, even with an area of 35.7 kGE, the OSIP core is still rather small if considering a multi-processor system.

The special hardware features in OSIP lead to significant performance improve- ment. For an initial performance analysis, the execution time of the commands in OSIP is compared with that in LT-OSIP. In the current systems, there are dozens of commands supporting the OSIP APIs. However, not all commands are critical for the system performance, e.g., the commands for configuring systems. These commands are executed only once at the beginning of an application, therefore, irrelevant for the entire system performance. The commands for getting and setting the system information, which normally occur occasionally and are simple, are typically also not critical for the system performance.

The most critical commands are task-related and frequently issued, and/or often involve task scheduling and mapping or synchronization. Through the profiling of the applications, five “hot-spot” commands are identified, which are for creating a

ready task (CRT), creating a dependent task (CDT), synchronizing tasks (ST), requesting a task (RT) and fetching a task (FT). Their occurrence frequencies ( f_cmd) are given in Figure 3.14, the sum of which is 99.9%. The figure also shows the average execution cycles for these commands when running an H.264 decoder application. As OSIP and LT-OSIP have the same clock frequency, the ratios of the execution cycles are also the speed-up factors for these commands by OSIP. Certainly, the execution time of the commands varies from one application to another, and also varies within an application at different phases. It largely depends on the size of task lists and the scheduling and mapping algorithms. The numbers presented in the figure result from a complex scheduling and mapping algorithm for a moderate-sized system with 7 ARM processors, and the system and task configuration is quite generic. In practice, the configurations and the scheduling and mapping algorithms can be specifically

CRT CDT ST RT FT 0 2,000 4,000 6,000 8,000 3,791 869 1,912 7,197 337 498 ₃₀₅ ₂₅₂ 792 135 Commands Cy cles LT-OSIP OSIP CRT CDT ST RT FT 19.5% 7.3% 20.3% 26.4% 26.4% Occurrence frequency of the commands ( f_cmd)

Figure 3.14: Execution cycles of “hot-spot” commands and their occurrence frequency in percentage

simplified or optimized for different target applications. However, this figure is meant to provide a first impression how much the special instructions in OSIP can improve the command execution time. For these “hot-spot” commands, up to a speed-up of 9.1× can be achieved. If considering the frequency how often the commands are executed, the average speed-up is 7.7×, following Equation 3.1. The relative low speed-up for the commands CDT and FT is due to the fact that they do not cause scheduling or mapping during the execution.

tLT-OSIP t_OSIP = P cmd (Cycles_cmd,LT-OSIP · f_cmd) P cmd (Cycles_cmd,OSIP· f_cmd) =7.7 (3.1)

If taking the area overhead of OSIP into consideration, the Area-Time-Product (AT), i.e. the area efficiency, is improved by a factor of 5× by OSIP, as calculated in Equation 3.2. AT_LT-OSIP ATOSIP = A_LT-OSIP·t_LT-OSIP AOSIP·tOSIP =5.0 (3.2)

The performance analysis in this section is rather preliminary, in which the OSIP efficiency is only analyzed in an isolated way. However, the performance of a sys-

tem does not only depend on the efficiency of the task manager, even though it is certainly an important factor, but also on many others, such as task sizes, the communication architecture, etc. The bottleneck component of the system finally determines the system performance. The bottleneck could be the manager, or the communication architecture as well as the PEs. A systematic system-level analysis needs to be made for a more comprehensive evaluation of the OSIP performance, which will be shown in the next chapter.

3.3.2 Power, Energy and Area-Time-Energy Product (ATE)

The power consumption of the OSIP core is estimated using post-synthesis gate-level power simulation with Synopsys PrimeTime [188], running a H.264 video decoding at a clock frequency of 690 MHz with a supply voltage of 1.0 V. The power of the memories is not considered.

As described in Section 3.2.3.3, the OSIP core has two states: busy and idle. In Table 3.2, the average power consumption of OSIP at both states is listed. At the idle state, the OSIP core consumes about 7.2% of the power at the busy state. While the static power stays almost unchanged, since it is only influenced by the area, the dynamic power is reduced by a factor of 17.3×from the busy state to the idle state.

Table 3.2: Power consumption of OSIP and LT-OSIP

OSIP LT-OSIP Busy (mW) 17.20 8.54 Dynamic Power (mW) 16.92 8.38 Static Power (mW) 0.28 0.16 Idle (mW) 1.25 0.95 Dynamic Power (mW) 0.98 0.80 Static Power (mW) 0.27 0.15

A detailed analysis of the power consumption of the OSIP core is depicted in Figure 3.15. For both OSIP states, the main contributor to the power consumption is the clock tree, including the clock gating elements and the clock pins driving the registers.3 In the busy state, the contribution by the registers and the combinational logic is also considerably large in comparison to the idle state. For the latter, the contribution by the registers and the combination logic is only due to the static power, as the complete pipeline is deactivated, i.e., no data switching in the pipeline.

Table 3.2 also presents the average power consumption of LT-OSIP. Compared to OSIP, LT-OSIP consumes less power than OSIP, both in the busy and idle state, which

3_{In a gate-level synthesis, the clock tree buffers are typically not generated. Therefore, Figure 3.15 does}

1.24mW 8.34mW 2.32mW 5.29mW a) Busy state 0.21mW 0.76mW 0.06mW 0.21mW b) Idle state Registers Combinational

logic Clock gating

Clock pin of registers

Figure 3.15: Power profile of OSIP

is natural. In the busy state, OSIP has to finish the same amount of work as LT-OSIP, but within a shorter time. For the idle state, OSIP has a higher power consumption mainly due to the larger area. The higher dynamic power of OSIP in this state is caused by a larger clock tree which is also resulted from more registers.

It is, however, more important to compare the energy efficiency, in this case, the average energy consumption per task scheduling and mapping. The ratio of the energy efficiency between OSIP and LT-OSIP can be calculated by Equation 3.3, in which the number of tasks (#Tasks) is the same for a given application, independent of which task manager is used.

E_task,LT-OSIP E_task,OSIP =

(P_busy,LT-OSIP·t_LT-OSIP)/#Tasks

(P_busy,OSIP·t_OSIP)/#Tasks =3.8 (3.3)

Together with Equation 3.1, it is shown that OSIP only consumes 26.3% of the energy of LT-OSIP to handle a task, while improving the task management performance by a factor of 7.7×. Considering further the OSIP area overhead, the Area-Time- Energy Product (ATE) of both task managers per task is compared in Equation 3.4.

ATE_task,LT-OSIP ATE_task,OSIP =

A_LT-OSIP·t_LT-OSIP·E_task,LT-OSIP

A_OSIP·t_OSIP·E_task,OSIP =18.9 (3.4)

The energy analysis above assumes that OSIP is completely busy during the application execution. In reality, this is not the case. For a more accurate analysis, the energy consumption at both busy and idle state as well as the application execution time need to be considered, which will be shown in Section 4.3 of the next chapter.

3.4 Summary

In this chapter, an overview of OSIP-based systems is given, and the major advantages of such systems are highlighted from the efficiency and flexibility perspective.

As the key component of the system, the architecture of OSIP – an application- specific processor for OS, is described in detail. A preliminary performance analysis already shows the efficiency of OSIP in the task management by comparing it with a generic RISC processor. The control-centric OSIP architecture development is chal- lenging. But it is effectively overcome with the special hardware features for handling list-based operations, fast memory accesses and comparing list nodes as well as com- pact branch instructions. The hardware results including the area, timing and the power are presented, and the area and energy efficiency of OSIP is highlighted.

In document Dynamic task management in MPSoC platformson (Page 66-71)