Two applications are selected to benchmark the OSIP performance in the task man- agement. The first one is a generic synthetic application, intended for covering a wide range of analysis cases, helping to investigate the limits of OSIP by considering the number of tasks, task sizes, the number of PEs and the amount of data traffic. The second application is a popular real-life application from the multimedia domain – H.264 video decoding. In the following, both applications will be briefly introduced, and the mapping of the applications onto the system will be described.
OSIP Configuration Task graph Task mapping Priority-based scheduling Data producing Task generation Data consuming ARM0 . . . ARM1 ARM11
Start Data producing(Task
p) Task generation (Taskgen) Data consuming (Taskc) End
Figure 4.2: Task graph and OSIP configuration of the synthetic application
4.2.1
Synthetic Application
The synthetic application consists of three major task types: data producing (Taskp),
task generation (Taskgen) and data consuming (Taskc). The dependencies between them are shown in the upper part of Figure 4.2. First, Taskp produces a set of data into the
shared memory and issues the execution of Taskgen, after the program is started. Then Taskgen generates a set of Tasksc, which consume the data by summing them up and
send the result to the I/O.
The lower part of Figure 4.2 shows how the system is configured to execute the application. The ARM processors are divided into two PE classes. The first class contains only one processor (ARM0), onto which both Taskp and Taskgen are mapped.
This processor is named Producer PE (PPE). The second class contains the rest ARM processors, executing Tasksc. These processors are named Consumer PEs (CPEs). All
CPEs have the same priority to execute a task, if no tasks are running on them. OSIP is responsible for deciding which task should be executed as the next and mapping it to an available CPE. In order to push OSIP to its performance limit, a priority-based scheduling algorithm is chosen for deciding the next to be executed task. A system with n consumer PEs is further referred to as an n-CPE-system.
Following parameters are configurable in the synthetic application to create dif- ferent workloads for OSIP.
• N_CPE ∈ {1, 3, 5, 7, 9, 11}: This parameter configures the number of CPEs. Increasing N_CPE potentially creates requests to OSIP for task execution more frequently, hence increases the OSIP workload.
• N_TASK∈{11, 88, 165}: This parameter configures the number of the generated
Tasksc. It is chosen as a factor of 11, such that the tasks could be distributed to
CPEs in a balanced way in the largest target system within this analysis, namely the 11-CPE-system. As mentioned above, the task scheduling is priority-based. To find the best candidate task for execution, OSIP has to loop the complete task list. This means that the workload of OSIP increases linearly with the list size. When increasing N_TASK, the size of the task list potentially increases, which implies higher workload for OSIP.
• N_ACCESS ∈ {1, 6, 11}: This parameter configures the frequency of accessing the same data from the shared memory by a Taskc. The size of the data set
produced by Taskp is fixed to 500 32-bit words, which are stored in the memory
as an array. Without considering the communication overhead, the task size with an N_ACCESS of 1, 6 and 11 corresponds to 2.5 kcycle, 15 kcycle and 27.5 kcycle, respectively. The tasks of the different sizes are further referred to as small tasks,
medium tasks and large tasks.
These configuration parameters also have a big impact on the system communi- cation, which will be discussed later in detail.
4.2.1.1 OSIP Working Scenarios
Based on the configuration parameters above, three scenarios are defined from the perspective of the workload of OSIP, representing different types of applications:
• Best case scenario – Low workload: In this scenario, low workload is generated to OSIP by configuring the size of Taskc to the maximum (N_ACCESS = 11) and
the number of Tasksc to the minimum (N_TASK = 11). In this type of applica-
tions, it takes a PE quite a long time to finish a task before the PE requests a new task from OSIP. This makes the frequency of accessing OSIP low. Furthermore, it takes OSIP less time to handle a request (in this case, to find the best candidate task from the list), because the task list is short. So, from the OSIP perspective, this scenario is the best case for it, because it is only little stressed.
• Worst case scenario – High workload: This scenario is exactly the opposite case to the scenario above. Here the task size is set to the minimum (N_ACCESS = 1) and the number of Tasksc is set to the maximum (N_TASK = 165). In this
scenario, the PEs are able to finish the task within a short time, hence request new tasks from OSIP very frequently. In addition, the scheduling effort of OSIP becomes much higher due to a much larger task list. Therefore, for OSIP this appears to be the worst case.
• Average case scenario – Medium workload: In this scenario, both the task size and the number of tasks are set to medium (N_ACCESS = 6, N_TASK = 88). This configuration averages the workload of OSIP between the best case and worst case scenarios.
OSIP Configuration Task graph Task mapping Priority-based scheduling ARM0 . . . ARM1 ARMn Entropy decoding IQT/IDCT Intra predication Deblocking filter Entropy decoding Inverse quantization (IQT) IDCT Deblocking filter Intra prediction Compressed video Uncompressed video
Figure 4.3: Task graph and OSIP configuration of H.264
In addition, for all three scenarios, the configuration set of N_CPE is iterated during the performance analysis.
4.2.2
H.264 Video Decoding
The software implementation of the H.264 video decoding follows a 2-D wave concept, which is similar to the one introduced in [127]. In this implementation, the video frames are build up with the so-called Macroblocks (MBs), which are the basic data elements that the decoding algorithm operates on. Using the 2-D wave concept, the decoding of each MB has dependency on its three possible neighboring MBs, which lie on its left, top and top left side. A simplified task graph and the OSIP configuration for the application are given in Figure 4.3.
In the task graph, the main functional blocks of the H.264 decoding are presented. First, the entropy information of compressed video frame data is decoded, out of which the MB data structures are built. The data of each MB are then re-scaled by an inverse quantization (IQT) and transformed by an inverse discrete cosine transform (IDCT). Afterwards, an intra-frame prediction block is conducted on the data out of IDCT, which utilizes the spatial correlation to the previous decoded neighboring MBs and predicts the current MB. Finally, a deblocking filter is applied to remove the spikes on the edges of the MB.
Among the main functional blocks given in the figure, IQT, IDCT and intra-frame prediction are the task types, that are computationally intensive and at the same time can be highly parallelized. In the OSIP configuration, the PEs are divided into
two classes. The first class contains a single PE (ARM0), on which the task type
entropy decoding is mapped. After the entropy decoding is finished, parallel operations
on the MBs are possible. The second PE class then includes all PEs, and all task types except the entropy decoding are mapped onto this class. In the actual software implementation, IQT and IDCT are merged.
For the system performance analysis purpose, the number of PEs is set to be configurable in the software implementation. The different PE number results in different frame rates, which are given in Frame per Second (fps). This frame rate is used as the criterion for evaluating the OSIP efficiency.