Applications development cost - Evaluation of multi-core platforms with configurable accelerato

4 Evaluation of multi-core platforms with configurable accelerators

4.1 Applications development cost

The aim of this section is to evaluate the aspect related to the cost of the applications development on multi-processor system on chip, and the related implementation of application specific or reconfigurable acceleration.

When dealing with multi-core systems the first step in the application development consists of the partitioning of the target application among the computational cores of the platform that can be either homogeneous or heterogeneous. When programing a heterogeneous MPSoC, and the number of cores composing the system is relatively low (i.e., up to 4) this partitioning can be performed manually, and handled with commonly used programming languages, such as C or C++.

This is the case of most ASSPs described in Section 1, that utilize a controlling core plus a set of application specific hardware accelerators or powerful DSPs. The choice of such kind of programming languages has the advantage of offering a very high programming legacy due to their large utilization in many kinds of domains. On the other hand, these languages do not provide natural statements to provide synchronization or, more in general, to handle parallelism. The lacks of theses languages are often compensated in such kind of devices by the utilization of pre-packaged libraries provided by the devices vendors for standard kernels, that drastically reduces the development time of the final users which only perform the wrapping between the application kernels. This is the programming style of Morpheus, where the ARM processor programmed with the C language provide synchronization and control of the reconfigurable engines, while the programs running on each reconfigurable engine are developed independently, packaged in a configuration bitstream and loaded at run-time by the ARM processor.

Even if this kind of programming can be acceptable when we deal with a relatively small number of processors, the manual partitioning and synchronization of applications cannot be handled manually when dealing with a large number of processors working concurrently. Programming languages for parallel systems (MPI, OpenMP, CUDA, OpenCL) are usually based on the C or C++ language, extended with application programming interfaces (APIs), or pre-processor directives that provide support for synchronization, explicit description of parallelism, handling of memory space allocation and vectorized data transfers. In particular, the OpenCL programming model utilized for the Manyac platform has the advantage of supporting a heterogeneous set of devices, either characterized by high

data-131

level parallelism (for which a data parallel programming model is provided) or task level parallelism.

The second step in the application development consisting of application development on multi-core (re)configurable processors is performed partitioning the computational workload between software and configurable hardware. Although many high level synthesis tools have been proposed, from the practical point of view, the most common methodologies for the design of hardware on both silicon and FPGAs are still based on the register transfer level (RTL) description, created by hand utilizing, for example the VHSIC Hardware Description Language (VHDL). Such task leads, in the proposed platforms at the programming of the configurable engines of Morpheus utilizing the reconfigurable engines proprietary tools, and the designing of hardware accelerators utilizing the Griffy flow of the Manyac platform.

The proposed evaluation should then take into account an estimation of both management of thread/task-level parallelism, synchronization and wrapping of an application and the implementation of the application specific accelerators on run-time configurable fabrics, or application specific circuits.

The evaluation of the programming productivity is a very important problem that has been the object of studies over the last 30 years. Although software productivity estimations and evaluation methodologies are currently under investigation, some of those have already been ported to the field of the hardware description languages [76]. As the implementations of applications described in the context of this work are handled with a heterogeneous set of software programmable processors and run/design-time configurable components, a heterogeneous set of languages was analyzed. In order to

Table 4.1: Function point analysis parameters.

132

evaluate the development productivity of applications on the different platforms a well-known technique, which provides ready to use data for many programming languages has been utilized: the Function Point Analysis (FPA) [77].

Table 4.1 summarizes the parameters necessary for performing the analysis, referred to the languages utilized to program the evaluated platforms. In order to evaluate the programming productivity of the proposed platforms, we approximated the wrapping/synchronization stage of the application to be implemented utilizing the C language even if actually realized with extensions of the C language (i.e., OpenCL). Thus, according to the model, the advantage in the utilization of these languages consists of the fewer number of statements utilized to achieve the same.

On the other hand, the implementation of the accelerators, or the programming of the reconfigurable engines has been performed utilizing VHDL, Griffy-C or NML depending on the application. Griffy [65] and NML [64] show similarities with intermediate representations (IR) utilized by most compilers to produce assembly code. In fact, modern compilers utilize high-level intermediate representations, often based on Static Single Assignments

TABLE

133

(SSA) to implement more efficient optimization steps [78][79][80]. For that, we consider ASM as a reference for NML and Griffy FPA. In order to perform the FPA on the selected test cases, we inspected the source codes of the applications implemented on the described platforms, the pure software implementation and the FPGA implementations. Then, we utilized the parameters reported in Table 4.1 to perform the design effort estimation, starting from the number of statements extrapolated from the application source codes. Table 4.2 reports the resulting data, expressed in person day.

According to the estimations, implementation of kernels utilizing the reconfigurable engines proprietary languages requires much of the design effort, while the control and synchronization tasks implemented on the ARM processor only require a minor effort. In addition, the VHDL implementations on the eFPGA core require a development time considerably higher that the other applications. In Figure 4.1, design efforts of applications implemented utilizing the different languages have been compared with the C language ARM implementations of the same algorithms and with the VHDL implementations.

134

The results shows that the FPGA implementations of the proposed applications require a design effort 42% to 76% larger compared to the Morpheus implementation. On the other hand as expected, the C implementations require smaller efforts. Nevertheless, it should be noticed that manual optimizations typical of signal processing algorithm implementations on embedded processors and DSPs (e.g., assembly coding of critical kernels) were not performed in this context. Although the absolute number of person days seems to be under-estimated, results of the analysis are in line with our practical experience from the qualitative point of view, giving a good view of development time ratios among different implementations.

Figure 4.1: Estimation of design effort required to implement selected applications on different computational platforms.

135

In document Multi Processor Systems On Chip with Configurable Hardware Acceleration (Page 130-135)