H.264 Application - Programming heterogeneous MPSoCs : tool flows to close the software product

4.3 Benchmarking

4.3.3 H.264 Application

Synthetic benchmarking serves to observe general characteristics of the OSIP solution. However, it says little about the impact of these characteristics on real-life applications. For this reason, this section analyzes the performance of an H.264 video decoder with the different implementations of OSIP. To quantify the application performance, the average frame-rate is used, measured in frames-per-second (fps). The parallel implementation of the H.264 decoder follows the 2D-wave concept [32]. Due to this parallelization approach and the video format, the application has a theoretical maximum speedup of 8x which represents the maximum amount of macro blocks that can be processed in parallel. The true maximum depends on the video sequence. Dynamic sequences typically feature more dependencies among macro blocks, which reduces the maximum parallelism.

The benchmarking results of the H.264 application are shown in Figure 4.8. These results are in accordance with the trends observed in the synthetic benchmarks. UT-OSIP, the version that reported the lowest average overhead, is also the version that produces the best application performance. Similarly, the worst application performance is achieved with LT-OSIP, which also reported the highest average overhead in Figure 4.6. As the figure shows, ARM-OSIP and LT-OSIP saturate at around 25 and 20 fps, while OSIP achieves a maximum of 34.9 fps. In order to achieve a framerate of 25 fps, 7 cores are needed in an ARM-OSIP-based platform while 4 processors are enough when using OSIP. Also in this study, the abstract model closely follows the performance of OSIP.

The straight thin line in Figure 4.8 shows the maximum speedup, computed by mul- tiplying the rate on a single processor of around 8 fps by the number of processing ele- ments. The deviation observed from the theoretical maximum is due to the overhead of the multi-tasking APIs and the restricted parallelism of the application. The maximum speedup achieved by OSIP was of 4.36x, by ARM-OSIP of 3.12x and by LT-OSIP of 2.5x. Note that the curves in Figure 4.8 start to saturate at around 5 and 6 processors. This saturation point possibly indicates a true maximum speedup of 5x to 6x as opposed to the theoretical maximum of 8x.

72 Chapter 4. MPSoC Runtime Management

4.4 Synopsis

This chapter presented the OSIP processor and its associated lightweight runtime system. It also introduced the simulation models used in the virtual platforms throughout this thesis. As the benchmarking showed, an OSIP-based MPSoC can efficiently execute applications with fine-grained tasks, which is key for some of the case studies in the upcoming chapters. However, this by no means implies that the methodologies presented in this thesis are only applicable to OSIP-based systems. In fact, MAPS also includes code gen- eration for mainstream parallel programming APIs such as Pthreads and MPI as well as proprietary APIs, e.g., for TI OMAP processors.

The benchmarking presented in this chapter included simplifications of the target hardware platform, such as an idealized interconnect as mentioned in Section 4.3.1. More hardware-oriented considerations as well as a scalability analysis of OSIP are the matter of current research and out of the scope of this thesis.

Chapter 5

Sequential Code Flow

Chapter 1 alluded to the problem of sequential code in current embedded systems, show- ing that methodologies and tools are needed to help migrate legacy code to new parallel platforms. As discussed in Section 3.2, state-of-the-art solutions differ from each other in the problem setup, e.g., input language, programming restrictions, parallel output and target platform characteristics. This chapter describes a solution to the sequential problem with the setup presented in Section 2.4.

The chapter is organized as follows. Section 5.1 describes the overall tool flow to obtain a parallel specification from a sequential C application. Sections 5.2–5.4 provide details about the main phases of this flow, followed by examples in Section 5.5. There- after, Section 5.6 lists the deficits of the current implementation. The chapter ends with a summary in Section 5.7.

5.1 Tool Flow Overview

An overview of the sequential flow is shown in Figure 5.1. The inputs to the flow are the application code itself and the architecture model. The main goal of this flow is to obtain a suitable parallel implementation that can be then transformed into a parallel program, e.g., using CPN or Pthreads. Although this chapter’s focus is on methodologies and algorithms for parallelism extraction, a semi-automatic code generator is also included in the flow.

The flow in Figure 5.1 is divided into three phases. (1) The analysis phase, in Fig- ure 5.1a, produces a graph representation of the application that includes profiling infor- mation, i.e., the sequential application model from Definition 2.29. This phase accounts for the first step of the sequential problem in Section 2.4.2. (2) The parallelism extraction phase, in Figure 5.1b, accounts for the remaining two steps of the sequential problem statement. (3) Lastly, the final backend phase in Figure 5.1c, is in charge of exporting the parallel implementation in different formats. The tool flow uses the open source projects LLVM [150] and Clang [149]. These projects are briefly introduced in Section 5.1.1. There- after, Section 5.1.2 gives an overview of the tool flow components.

In document Programming heterogeneous MPSoCs : tool flows to close the software productivity gap (Page 81-83)