Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures

(1)

Dynamic Profiling and Load-balancing of

N-body Computational Kernels on

Heterogeneous Architectures

A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF SCIENCE IN

THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES

2010

By Andrew Attwood School of Computer Science

(2)

2

List of Tables

Table 2.1: LIBSPE2 functions 27

Table 2.2: SPE DMA functions 27

Table 2.3: Thread and fork creation overhead 28 Table 3.1: Offset pattern for 100 elements 42

Table 4.1: Parallel languages 53

Table 5.1: Gprof output non-threaded 10 particles 10 steps 61 Table 5.2: Gprof output non-threaded 60 particles 10 steps 62 Table 5.3: Nmod speed-up values 63 Table 7.1: Single and dual thread performance 101

Table 7.2: SPE timing data 102

Table 7.3: CELL Automated profiling algorithm timing 103

(6)

6

List of Figures

Figure 2.1: CBE block diagram 23

Figure 2.2: Division of computation CBE 24

Figure 2.3: Stream Processing CBE 25

Figure 2.4: CBE compile commands 26

Figure 2.5: Bus capacity 29

Figure 2.6: GPU CPU footprint comparison 31 Figure 2.7: Nvidia processor Architecture 32 Figure 2.8: Two dual threaded SMP processors 34 Figure 2.9: Two SMP single thread processors 34 Figure 3.1: Computation level given index 39

Figure 3.2: 5-Body example 39

Figure 3.3: Body index calculation ratio 40 Figure 3.4: Block cyclic schedule 41 Figure 3.5: Pseudo code for in process offset calculation 41

Figure 3.6: Cyclic Schedule 43

Figure 3.7: Equal area schedule 43

Figure 3.8: DAG approach 45

Figure 4.1: Waterfall model 50

Figure 4.2: Message passing pattern 54

Figure 4.3: The divide and concur pattern 56 Figure 4.4: CELL development process 57 Figure 4.5: Structure of array and arrays of structures 58

Figure 4.6: Implementation plan 59

Figure 5.1: N-body test pattern loaded into matlab 64 Figure 5.2: Application code to generate random body distribution and mass 65 Figure 5.3: JSP Diagram of the NMOD Application 66 Figure 5.4: Original Function declaration for findgravitationalinteractions 67 Figure 5.5: Thread communication structure 68 Figure 5.6: Pseudo code for thread creation and balancing on SMP 69 Figure 5.7: Nmod SMP cache coherent program design 69 Figure 5.8: CELL SPE program design 70 Figure 5.9: Posix memalign function prototype 71

Figure 5.10: Nvidia GPU design 72

Figure 5.11: Load-balancing and profiling diagram 75 Figure 6.1: Cell Compilation commands 78 Figure 6.2: Compile command for threaded SMP 78 Figure 6.3: SMP multi-threaded implementation with load balance 81 Figure 6.4: Calculate gravitation interactions function SMP 84 Figure 6.5: Force accumulator for each SPE aligned to 16 byte borders 86 Figure 6.6: Trans data structure initialisation 86

Figure 6.7: Loading SPE image file 87

Figure 6.8: Passing arg structure to SPE 88 Figure 6.9: SPE context creation 88 Figure 6.10: SPE DMA transfer operation 90 Figure 6.11: CUDA call find gravitational interactions 92 Figure 6.12: CUDA find gravitational interactions 93 Figure 6.13: Load balance and profile GPU/NVIDIA 95

(7)

7

Figure 6.14: Load-balancing implementation 97

Figure 6.15: Finalising test 98

Figure 6.16: Assembly commands for hardware timers 99 Figure 7.1: CBE 2 PPE thread Speed-up 101

Figure 7.2: 6 SPE Speed-up 102

(8)

8

List of Equations

Equation 5.1: Amdahl’s Law 62

Equation 5.2: NMOD Speed-up value 63

(9)

9

Abstract

Increasingly, devices are moving away from a single architecture single core design. From the fastest supercomputer to the smallest of mobile phones, devices are now being constructed with heterogeneous architectures. The reason for this heterogeneity is owing, in part, to the slowing of speed increases on single chip single core devices, but equally, the realisation that coupling specific devices to cope with specific problems provides increased performance and power efficiency. Application development processes concerned with dealing with multicore heterogeneous technologies are still in their infancy. Developing high-performance applications for the scientific community gives the responsibility to the developer to maximise the use of the underlying architecture. Traditional approaches to the development process are inadequate to deal with complexity in instantiation of computation to heterogeneous architectures. Current load-balancing algorithms fail to provide the required dynamism to best fit computation to the available resource. This project seeks to design an algorithm to optimise the application of a computational problem to the available heterogeneous resource through runtime profiling and load-balancing. CELL and Nehalem/GPU heterogeneous architectures are targeted with an N-Body simulator combined with the implementation of a profiling and load-balancing algorithm. This implementation is subsequently tested under different computational load conditions.

(10)

10

Declaration

No portion of the work referred to in this report has been submitted in support of an application for another degree or qualification of this or any other university or other institute of leaning.

(11)

11

Copyright

i. The author of this dissertation (including any appendices and/or schedules to this dissertation) owns any copyright in it (the ‘Copyright’) and s/he has given The University of Manchester the right to use such copyright for any administrative, promotional, educational and/or teaching purposes.

ii. Copies of this dissertation, either in full or in extracts, may be made only in accordance with the regulations of the John Rylands University Library of Manchester. Details of these regulations may be obtained from the Librarian. This page must form part of any such copies made.

iii. The ownership of any patents, designs, trademarks and any and all other intellectual property rights except for the Copyright (the ‘Intellectual Property Rights’) and any reproductions of copyright works, for example graphs and tables (‘Reproductions’), which may be described in this dissertation, may not be owned by the author and may be owned by third parties. Such Intellectual Property Rights and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property Rights and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication and exploitation of this dissertation, the Copyright and any Intellectual Property Rights and/or reproductions described in it may take place is available from the Head of Department of Computer Science.

(12)

12

Acknowledgements

I would like to take this opportunity to thank my supervisor, Dr John Brooke, for his support and guidance during the past 16 months. My thanks also go to Dr Carey Pridgeon for developing the nmod application, which provides the computational base for my algorithm, and his kind response to my questions. I would also like to thank Dr Jonathan Follows for proving the CELL and Nvidia CUDA training at STFC Daresbury. Finally, I would also like to thank my family for their help and support over the past two years.

(13)

13

List of Abbreviations

CBE Cell broadband engine

SPE Synergistic processing element PPU Power processing unit

SIMD Single instruction multiple data SPU Synergistic processing unit MFC Memory flow controller LS Local store

DMA Direct memory access controller EIB Element interconnect bus

SPUFS Synergistic processing unit file system MFC Memory flow controller

SIMT Single instruction multiple thread ILP Instruction level parallelism SMT Symmetric multi-threading JSP Jackson structured programming HPC High performance computing GCC GNU c compiler

AGP Advanced graphics port GPU Graphics processing unit

PCIe Peripheral component interconnect express CPU Central processing unit

(14)

14

Chapter 1

Introduction

Heterogeneous computing is the coupling of dissimilar processor architectures into a single process or separate processes interconnected by an in-system bus. Heterogeneous computing is viewed as the next evolutionary step in the development of the CPU. Over the past 5 years, we have witnessed great advances in terms of the availability of processing cycles on devices other than the core central processing unit. However, our reliance on a single fast processor core has been challenged, as we have seen the slowing of the speed increases of these devices [1]. In order to combat this issue, most manufacturers have developed multi-core variants of their standard processor. Over the past ten years, graphics card manufacturers have been increasing the capability of their cards; this has mainly been to keep pace with the requirement of users, who now demand ever realistic game graphics and physics. Recently, we have seen libraries released which enable users to make use of the GPU as a massively powerful compute resource. Notably, at the time of writing, it is common to see quad core chips in standard desktop machines, combined with powerful compute capable GPU. We also see heterogeneous chips, such as the Cell Broadband Engine, used in the world’s fastest computer [2]. This chip, in contrast to the CPU GPU relationship, combines heterogeneous components into a single process, as opposed to discrete elements connected by an external system bus.

(15)

15

Developing applications for single threaded execution can be difficult enough, but when also facing the challenge of multiple cores and heterogeneous compute elements, the challenge increases. Multi-threading libraries support the instantiation of virtual threads, and the underlying operating system will then accordingly schedule these virtual threads onto the available homogeneous hardware. Operating systems are unable to schedule threads across heterogeneous compute components, which consequently leaves the control of thread execution on these elements to the application which wants to make use of them. Application developers need to make decisions concerning how best to apply the available computation to the hardware elements in the target system.

Load-balancing concerns the application of computational tasks to the available elements in the machine. Failing to properly and fairly distribute load over the available elements will result in less than optimal runtime performance [3]. In a high-performance compute environment, it is essential that idle time is reduced as much as possible so as to maximise the return of investment made in the data centre infrastructure. During the lifetime of a computational task, the quantity of work assigned to an individual subsystem can change; this can lead to an imbalance which can emerge at any point during the lifetime of the computation.

Design time profiling is a common step in the development process. As an application is being developed, profiling is conducted in order to determine the sections of code that would benefit from some additional fine-tuning. These sections usually consume a disproportionate amount of the whole application runtime. Usually, it is these sections that we strive to split between the available resources. The complexity arises when these tasks are to be split amongst the available resources. It is only at runtime that we can determine the existence and capability of heterogeneous components and the cost of computation instantiation. This leads to the requirement to profile the application at runtime.

(16)

16

Planning for the development or redeployment of application code to a heterogeneous high-performance environment is a challenging activity. Conventional multi-threaded development is supported by a number of methodologies and design techniques, concerned with aiding in the development/transformation process. The ability to structure the development activity and to accordingly support the development through a design procedure is an important aspect of this project.

(17)

17

1.1

Motivation and Context of the Study

This project is concerned with the development of a runtime-profiling and load-balancing algorithm to support the computational deployment to the available computing resource. With the ever-increasing complexity and heterogeneous mix of components in both desktops and servers, the ability to correctly assign tasks to resources will be critical when striving to realise the full potential of these machines. However, it should be noted that this project is concerned with floating point applications as opposed to integer applications.

High-performance applications typically refer to those that spend a disproportionate amount of time typically using few instructions but operating on a large data set. This data set—especially in physical simulations—is iterated over many types as the simulation evolves. Integer applications typically use a large number of instructions and smaller data sets. Office applications typically fall under this application type.

Successful algorithm development will require different heterogeneous components to target. As outlined in the previous section, GPU components are becoming increasingly common. We believe that targeting a heterogeneous system which comprises of GPU and host CPU is an essential architecture mix to include in this project. These devices are coupled using a PCI express system bus.

The second architecture type should not be constrained by the connectivity of a system bus; architectures of this type are referred to as a single-chip heterogeneous platform. Therefore, in order to target a single-chip heterogeneous platform, there are few affordable options available. At the time of writing, the only realistic option is to make use of the Cell Broadband Engine as found in the games console, the PlayStation 3. It may seem unusual to use a game station in the development of a complex, high-performance application; nevertheless, as shown by Buttari et al., the PlayStation 3 is more that capable for scientific research [4].

(18)

18

1.2

Project Aims and Methodology

The aim of this project is to design an algorithm capable of determining the best pattern of instantiation on the available computational resource for a given computational problem. Subsequently, the algorithm will be implemented using different heterogeneous platforms to validate the effectiveness of the approach; this will be achieved through the transformation of an existing HPC application using a methodology for targeting heterogeneous architectures.

The objectives of this project are summarised as follows:

• To understand the development process and profiling requirements of high-performance applications on heterogeneous architectures; and

• To obtain speed-up on each target architecture for the N-Body simulation.

It is important that we have a deep understanding of the development process of heterogeneous applications, and that we can profile applications in real time in order to better match computation to the available resource. To prove that the suggested approach for runtime profiling is effective, we need to show speed-up over the single threaded versions of the application.

Accordingly, in order to achieve the project aims, the following key points will need to be achieved:

• Understanding of the target heterogeneous architectures;

• Understanding of the development process for heterogeneous systems;

• Identification of the methods which enable runtime-profiling and load-balancing; and • Researching of the available tools and technologies to enable the implementation.

(19)

19

To realise the development of a heterogeneous application, a thorough literature review of heterogeneous development will be undertaken. Two target architectures will be used to validate the profiling algorithm: the first is the Cell Broadband Engine, developed by Sony Toshiba and IBM; and the second architecture is the Nvidia 9800 GTX graphics card with an Intel Nehalem host processor. These devices have a total of four discrete computational units, each with their own toolset and architectural nuances which will need to be explored before the implementation phase. In order to successfully implement a profiling algorithm and to accordingly profile the target application, the author will review the current literature and example case studies of multi-threaded and heterogeneous development; this will continue with a further general review of application development methodologies for both multi-threaded and heterogeneous development. In the same section, we will suggest an approach for development that will be used in the design and implementation. Following on from the literature review, we will detail the algorithm design followed by details of the implementation.

To test the load-balancing and profiling algorithm, a target application will be required. Notably, however, this is outside the scope of this project, and indeed its time constraints to develop such an application. Nevertheless, a number of applications were considered after referring to well-known lists of computational dwarfs detailed in [5]. As a result, we decided to implement an N-Body simulator, as it has sufficient complexity and can be easily adjusted to vary computational problem size. A suitable open source project was found titled ‘N-Mod’. Hosted by Google code and developed By Dr Carey Pridgeon, the application is provided in C source code. For the rest of the project, N-Mod will be referred to as the ‘target application’.

Once the target application has been modified to run on the target architecture, the results of the speed-up will be documented without the use of the balancing algorithm. The load-balancing algorithm will then be applied, and a comparison as to its effectiveness will be conducted.

(20)

20

1.3

Project Phases

Below is a summary of the main project phases:

1. Literature and background research.

2. Algorithm design

3. Prototype algorithm implementation:

a. Target platform installation and configuration

b. Development of N-Body algorithm.

c. Implement flexible load-balancing system

d. Implement load-balancing algorithm.

4. Test/Timing

(21)

21

1.4

Structure for the Dissertation

This dissertation contains eight chapters. The first chapter provides the introduction, and details the project’s background and the significance of the problem area, as well as the motivation for the research. Chapter One also details the aims, and a methodology to realise the aforementioned aims, and further details the main phases that will be conducted. Chapter Two investigates the architectural and development environments for the target heterogeneous architecture, as well as projects which have been successfully developed on these systems. Chapter Three will research the area of computational load-balancing—a component which is fundamental to the development of an algorithm that scales independent of architectural constraints. In Chapter Four, current profiling techniques will be reviewed, including both static and runtime techniques. Chapter Five will review development methodologies and design formalisation techniques for both single and multi-threaded development. Additionally, Chapter Five will bring the research together in such a way so as to propose a design for heterogeneous profiling and a load-balancing algorithm. In Chapter Six, the implementation of the algorithm using the target application will be detailed, highlighting how various challenges were overcome. Chapter Seven will show detailed testing of the converted algorithm for the target architectures, and will also include a review of the speed-up obtained using various core and load combinations. The load-balancing algorithm will also be tested so as to demonstrate that it can determine the best architecture for a given computational load. Chapter Eight provides a conclusion of the project, and recommends future work concerning the implementation and design of the load-balancing algorithm, as well as the methodology followed in its development.

(22)

22

Chapter 2

Heterogeneous Architectures

2.1

Chapter Overview

Computer systems have traditionally been deployed using single core homogeneous architectures. In order to meet growing user requirements, we have witnessed the development of systems which combine differing architecture types. However, the integration of different architecture to meet the demands of the user brings with it its own problems. For example, each architecture can differ drastically from the standard x86 design, and requires specific compiler support [6]. One main difference is the way in which these architectures interact with the main memory. This chapter will survey the literature relating to the target architectures that will be used in the application of the Profiling algorithm.

2.2

Cell Broadband Engine

The cell broadband engine is a single chip heterogeneous system developed by a partnership of Sony, Toshiba and IBM [7]. It has a revolutionary design consisting of a single PPU (modified Power PC 4 core) which controls 8 SPE (6 available in the Sony PlayStation). The PPU is a 64-bit dual simultaneous multi-threading processor which is similar to the PowerPC 970, and which has 32 KB of level one cache, split between separate instructions and data caches, and 512 KB of level two cache. The PPU has an SIMD engine based on the Altivec instruction set [8]. Although

(23)

23

powerful, the PPU is included as the controller of the SPE and is not seen as being a target for computational load. Usually, the PPE would run the operating system and control operations of the HPC application. Notably, the PPE is more capable of task-switching than the individual SPE, which is less capable of task-switching owing to the lack of branch prediction logic [9].

Each SPE consists of a SPU, 256 KB LS and MFC. The SPU is a 128-bit vector engine which uses a different instruction set to that of the PPE. Moreover, the SPU has a 128 by 128 bit register bank, which is a large number of registers in comparison to a standard x86 processor; however, there is no on chip cache. Furthermore, each SPU is not considered coherent with the main system memory, and can only access its own LS and local MFC. The MFC is a messaging and DMA controller. The MFC controller communicates with other MFC, PPE and main memory controllers using the EIB. The EIB consists of four unidirectional buses, which interconnect PPE, SPE, IO and memory controller. In order to provide the required communication requirements, the EIB operates at 204.8GB. With this in mind, Figure 2.1 shows the interconnections of all the major components of the CBE.

(24)

24

The PPE is capable of running an application code which is compatible with the PowerPC architecture. Application code is executed from the main memory transparently being transferred to L1 and L2 cache using a cache coherency protocol. SPU application code is located in the main memory, and a pointer to the application code is provided to the SPU by the PPU. The SPU can then direct the MFC to transfer the application code into the LS, and the PPU can subsequently execute the code directly from the LS. Once the PPU has finished with any variables in its own LS, it can then direct the MFC to transfer data back into the main memory. Messages can then be sent between the SPU and PPU so as to indicate the completion of processing [10]. Figure 2.2 details the division of two computational tasks between PPE and SPE. The PPU would perform various initialisations before executing the task on the individual SPE; the SPE would subsequently compute their assigned sub-section of the task before informing the PPE; the PPE would accordingly collect and combine the results before preparing and starting the next stage of the model evolution; and the PPE would then continue evolving the model, perform further compute stages, directing the SPE or terminate [9].

(25)

25

SPE and PPE can communicate using EIB by sending messages to each other’s mailboxes. This messaging capability provides alternative kernel instantiation patterns. One common method is to have each SPE conduct a unique stage of processing, as shown in Figure 2.3. This essentially creates a computational chain commonly found in video processing.

Figure 2.3: Stream Processing CBE

There are additional benefits to using a chain of SPEs when the application code is too big to fit into a single SPE; we can reduce the swapping of application space to and from the LS in the SPE by dividing the workload between the individual SPE.

2.2.1

Development of the CBE

Only the Linux kernel can communicate with the individual SPE, and so drivers have been developed in order to provide easy access to the developer.

(26)

26

The following functions are provided by the underlying driver [11]:

• Loading a program binary into an SPU;

• Transferring memory between an SPU program and a Linux user space application; and

• Synchronising execution.

The application running on the PPE makes use of the LIBSPEV2 (SPE Runtime Management Library V2) to control execution on the SPE. Moreover, the LIBSPEV2 utilises the SPUFS which is provided by the driver in kernel space. The PPE application uses SPE contexts provided by the LIBSPE2 to load the SPE application image, deploy the image to the SPE, trigger execution, and to finally destroy the SPE context [12]. The SPE contexts are considered key to controlling the execution of the binary image on the CBE. Each context needs to be run in an individual PPE thread which is spawned from the main PPE application.

Due to the heterogeneous nature of the devices, the application machine code for both the PPE and SPE need to be separately compiled. Importantly, Sony provides a GCC compiler which will create a binary executable for both architectures.

$ gcc -lspe2 helloworld_ppe.c -o helloworld_ppe.elf $ spu-gcc helloworld_spe.c -o helloworld_spe.elf

Figure 2.4: CBE compile commands

In Figure 2.4 you can see that GCC compiles the PPE elf executable, whereas the SPE elf file is compiled using the spu-gcc command. In order to execute the application, the PPE elf file is launched, which in turn loads the SPE elf into memory and accordingly directs the individual SPE to execute through that SPE’s context.

Each SPE requires its own context structure defined as spe_context_ptr_t. This structure is used for the purpose of storing information required to communicate with the SPE. The functions

(27)

27

listed in Table 2.1 are used to load the SPE elf image and, using the context structure load the image to the SPE, to initiate the image as well as to destroy and clear the SPE image from memory.

Name Functionality

spe_image_open() Load the image from the disk into main memory return pointer to the image.

spe_context_create() Create the context populating the supplied spe_context_ptr_t structure spe_program_load() Load the supplied image pointer to the supplied SPE defined by its

context.

spe_context_run() Run the current image on the context provided spe_context_destroy() Destroy the SPE context

spe_image_close() Remove the image from main memory Table 2.1: LIBSPE2 Functions

The application image, which is loaded onto the SPE, will have to be supplied with the data required for processing. This data will be located in the main memory or on a disk drive. It is the responsibility of each individual SPE to DMA its section of the data that it wishes to process from the main memory to its LS. DMA operations are provided to the developer using intrinsic functions defined in the spu_intrinsics.h and spu_mfcio.h header files. Intrinsic functions wrap a number of machine language commands that are not available to the GCC compiler.

Name Functionality

spu_mfcdma64() This function instructs the DMA controller in the SPE to transfer the specified number of bytes to or from Main Memory/LS. A tag is specified so a number of DMA transfers can be conducted at once.

spu_writech() This macro is used to communicate between SPU and MFC spu_mfcstat() We can call this function to stall the application until the transfer is

complete. We can use a technique referred to as double buffering so that the application can process the information it has instead of waiting for all data to be transferred.

Table 2.2: SPE DMA functions

Using the functions in Table 2.2 and a pointer to a data structure containing the addresses that are required, the SPE can then instruct the DMA controller to access the main memory and to subsequently make available to the SPU the data required for processing [12]. Each SPE has its own LS and communicates with the system memory using DMA transfers. The DMA controller

(28)

28

is connected to the other SPE, PPE and main memory using the EIB. Each SPE operates a separate address space to the PPE and the main memory, which is not coherent or uniform with other SPE or PPE. One implication of using DMA is that items to be used on the SPE have to be aligned to a 16 byte boundary with a maximum of 16KB per transfer and a limit of 32 simultaneous transfers [12].

Application code on the individual SPE is executed by calling the function

SPE_CONTEXT_RUN. This will cause the application to hold until the SPE has completed processing. When running 8 SPE at the same time, we need to run SPE_CONTEXT_RUN for each SPE without any one call causing the PPE application to halt. Therefore, in order to avoid the PPE application halting, each call to SPE_CONTEXT_RUN should be made in a separate thread [13].

Platform fork() pthread_create()

real user sys real user sys AMD 2.3 GHz Opteron (16cpus/node) 12.5 1.0 12.5 1.2 0.2 1.3 AMD 2.4 GHz Opteron (8cpus/node) 17.6 2.2 15.7 1.4 0.3 1.3 IBM 4.0 GHz POWER6 (8cpus/node) 9.5 0.6 8.8 1.6 0.1 0.4 IBM 1.9 GHz POWER5 p5-575 (8cpus/node) 64.2 30.7 27.6 1.7 0.6 1.1 IBM 1.5 GHz POWER4 (8cpus/node) 104.5 48.6 47.2 2.1 1.0 1.5 INTEL 2.4 GHz Xeon (2 cpus/node) 54.9 1.5 20.8 1.6 0.7 0.9 INTEL 1.4 GHz Itanium2 (4 cpus/node) 54.5 1.1 22.2 2.0 1.2 0.

Table 2.3: Thread and fork creation overhead

To start the SPE_CONTEXT_RUN in a separate thread, the thread C library can be used. P-thread is a C language programming interface specified in the IEEE POSIX 1003.1c standard [14]. It is preferable to create threads for each context due to the lower overheads when compared with additional process creation overheads, as shown in Table 2.3. Threads are created using the pthread_create() function call. Collapsing threads is accomplished using the pthread_join() function [14].

Access to SIMD instructions on the PPE are provided by the ALTIVEC headers. Notably, the inclusion of these headers gives access to the intrinsic functions which provide easy access to

(29)

29

SIMD instructions. The SPE uses a different set of instructions for SIMD operations, which leads to additional work porting the application from PPE to SPE. Double precision on the first generation CELL processor is relatively poor; IBM have corrected the issues in the second generation processor.

2.3

Nvidia GPU with Intel Nehalem

GPU are not general processing devices and are unable to run integer type applications, e.g. Operating Systems are consequently unable to operate in isolation. Typically, GPU are combined with a General Processing Unit, such as an X86 processor or ARM RISC; these processor types are more capable of running integer type applications. We will refer to the General Processing Unit as the CPU from this point forward. Usually, the GPU is connected to the CPU using the fastest system bus available. We have seen a change in recent years in the sense of moving away from AGP to the faster PCI express 2. Figure 2.5 shows the increase in data rate from AGP 4x to PCIe 2.

(30)

30

The CPU is capable of addressing the system’s main memory, but the processors on the GPU are only capable of communicating with their own dedicated on-board memory. The CPU communicates with the GPU over the PCI Express bus. However, the application code and data can be transferred over the PCI express bus from system RAM using DMA transfers using the on-board DMA engine on the GPU.

Whilst the CPU has evolved into a generic compute device capable of running a range of different operating systems and applications, the GPU has ultimately targeted specific applications. Mainly, these include 3D graphics targeting the computer games market. However, it is noteworthy that the types of massively parallel calculations performed in rendering 3D graphics can be put to use on more general HPC applications. Traditionally, porting applications from CPU to GPU was not a trivial task [16]; nevertheless, recent advances by card manufacturers have led to the development of a number of frameworks which actively simplify GPU development. Cuda, OpenCL and Direct Compute are three frameworks which aim to reduce the development time and, in the case of OpenCL and Direct Compute, enable easy cross platform development [17]. It is seen that the increasing usability of GPU frameworks and libraries combined with the increasing computational ability of GPU will increase future use in HPC applications.

2.3.1

Nvidia Architecture

The GPU architecture that will be used in this project is based on the Nvidia G80 processor. The typical configuration of the G80 is shown in Figure 2.7, and consists of 16 SM (stream multi processors), each of which comprise 8 computational units, thereby giving a total of 128 processors. Each computational unit can execute 32 threads giving 4,096 simultaneous threads. Computational units operate on single precision float point values; however, each processor has a single double precision unit, which fundamentally affects the ability of the GPU to perform double precision calculations at the same speed as it executes floating point precision

(31)

31

calculations. Each SM has its own 16KB L1 cache with a varying quantity of on-board main memory, which is dependent on card cost. The processor is designed to perform the same set of instructions in parallel on different data elements[17]. As we can see in Figure 2.6, the GPU has a greater surface area dedicated to compute activities compared with the CPU, whereas the CPU has large sections dedicated to control and caching owing to the nature of the general application types it has to execute.

Figure 2.6: GPU CPU Footprint comparison [18]

The GPU can forgo the expense of complex branch prediction hardware since the quantity of branches and random data accesses are low. One import aspect of targeting the GPU is the bandwidth available between the GPU and main memory over PCI expression, which is 8 GB/s. Between SM and on GPU ram, access time is as fast as 141 GB/s. Maximising on GPU data access would therefore give an application 16X memory performance.

2.3.2

Nvidia Development

Developing GPU applications requires the use of separate frameworks and compilers to the host architecture. Three main frameworks have been developed OpenCL, DirectCompute and CUDA. OpenCL is being developed by the Khronos group. Notably, it is envisaged as a framework and API abstraction for enabling applications to execute over a wide spectrum of heterogeneous architectures[19]. Direct Compute is part of Microsoft’s Direct X 10 release; as such, it is limited to the Windows platform. This project will be utilising the CUDA platform from NVIDIA.

(32)

32

Cuda was developed by Nvidia in 2006 as an extension to the C Language [18]. Computational kernels are written as C functions and preceded by __global__ declaration specifier. These functions can then be called in a similar manor to a standard c function, with the exception that we define the number of thread blocks and the number of threads in a block. Threads in each block are able to communicate with each other through a shared memory, and have the capability to synchronise. Each thread is executed by a single scalar multiprocessor, and threads are grouped into blocks that will be allocated to an individual multi-processor [20]. A number of blocks form a grid are also referred to as a kernel. Only a single kernel can execute on a device at any one time. [21].

Figure 2.7: Nvidia Processor Architecture

Each work group is divided into thread groups of 32, referred to as a warp. Warps always perform the same instruction stream on different data values. Nvidia refers to this as ‘Single Instruction Multiple Thread’[20].

2.3.3

Intel Nehalem Architecture

Intel’s Nehalem processor is the company’s first micro architecture to have separate memory controllers for each processor within the same SMP system [22]. In previous SMP systems, all processors are connected to the main system memory using a single memory controller. As processing speeds increased, this created a bottleneck for memory access. In [22], the authors demonstrate that Nehalem can operate up to 4 times faster than the previous flagship processor micro architecture Harpertown. The test server that will be used to evaluate the completed

(33)

33

algorithm has a single core i7 processor running at 2.67 ghz with 4 hyper threading cores. Nehalem architecture features turbo boost and Hyper-threading technology.

Hyper-threading is the variation of Symmetric Multiprocessing found in Intel micro architectures. Hyper-threading enables each processor core to run two threads. Operating systems will report two logical processors for every physical core. For our test system, this gives us 8 hardware threads; notably, however, this does not mean that 8 threads will be executing instructions for every clock cycle [23]. The Nehalem has a super scalar architecture which results in multiple instructions being run in parallel, referred to as ILP. Instruction pipelines of each logical processor will stall when data is required from the cache or main memory. In this instance, a core lacking SMT features will stall, which will consequently result in wasted clock cycles. In Figure 2.9, we are able to see that a single threaded processor contains a single architecture state, whereas a dual threaded processor has two architecture states.

(34)

34

Figure 2.8: Two dual-threaded SMP processors [23]

Each logical processor is created by the existence of an architectural state on the processor. This architectural state comprises both general control, and advanced interrupt control registers. The goal of hyper-threading technology is to allow thread-switching when the running thread stalls. Importantly, the stall could be for a number of reasons, including cache misses, handling branch mis-predictions, or pipeline result dependencies [23].

Figure 2.9: Two SMP single-thread processors [23]

Both virtual processors have access to the same L1 cache which is split into 32KB for data and 32KB for instruction. Each processor core also has its own 256k L2 cache. There is also an 8mb level 3 cache which is shared between all the processor cores. In order to maintain cache coherence, there is a snoopy bus which operates between L2 and L3 cache. The processor has its own DDR3 memory controller and Quickpath interconnect. Quickpath is a fast interconnect, which is able to connect up to four processors together.

(35)

35

Nehalem also supports Intel’s turbo boost technology, which enables the individual cores to speed-up or slow down dependant on requirements [24]. Temperate levels on the processor are monitored if a single thread is running then the other cores will be idle; this would result in a low processor temperature. If there is only one core active, that core can then be increased by 266.66 mhz; if there are two or more cores active, the increase will then be witnessed in steps of 133.33 mhz.

2.3.4

Intel Nehalem Development

The development for the Nehalem is similar to the development on the PPE of the Cell Broadband Engine. The P-threads library is used to create threads from the main application process; these threads can perform computation. Moreover, if there is a single Nehalem processor, then eight threads can be active at once. The Nehalem processor is capable of SIMD operation, and exposes the SSE instructions through intrinsic functions.

2.4

Chapter Summary

This chapter has provided background research on the heterogeneous architectures that will be targeted by the load-balancing algorithm. Research has shown that the CBE has cache coherent PPE which will be easy to target using the P-thread libraries. A similar approach can be adopted for the Intel Nehalem processor. Importantly, there is no reason why the same application code will operate on both processors; that is, unless SIMD intrinsic functions are used. The Nehalem has a faster clock rate and has four hyper-threading cores where the single PPE supports two threads. The PPE has less on-chip cache available, and has slightly older technology compared with that of the Intel, which has recently experienced an overhaul of its micro architecture. The CBE has a high-speed interconnect to the SPE processors, although the development of software for the SPE architecture is more complex than targeting the cache coherent PPE and Nehalem. The Nehalem will use the GPU as it heterogeneous processing element. The GPU is

(36)

36

connected to the processor using the PCI Express bus; this may add latency when transferring the model from the main memory to the on-board memory of the GPU. This may accordingly result in poor performance when limited processing is available, thereby resulting in a greater overhead-to-computation ratio. The implementation of a runtime profiler would result in targeting the correct compute component for a given computational task.

(37)

37

Chapter 3

Scheduling, Load Balancing and Profiling

3.1

Chapter Overview

Computational scheduling is the process of distributing a set of independent tasks to the underlying hardware with the aim of minimising the application runtime [25]. Load-balancing is a subset of this process, which is ultimately concerned with evenly distributing the load between the available devices. We are concerned with addressing the load imbalance, as this will introduce a disparity in execution time of the individual tasks and may therefore result in idle cycles in terms of those devices allocated with too few tasks. These idle cycles result in wasted time which could be used to increase the fidelity in the model being computed or otherwise save resources through requiring less time to process the current model fidelity. Profiling is concerned with timing an application and, more importantly, its components with a view to optimisation. Profiling is usually conducted during the period of design; however, this project is concerned with adapting the scheduling of computation based on runtime profiling.

3.2

Computational Scheduling

There are two main types of scheduling algorithm: static scheduling and dynamic [26]. The former—i.e. Static Scheduling, also known as compile time scheduling—divides the work into fixed size chunks, and accordingly distributes them between available processors. Examples of

(38)

38

such algorithms include bloc, cyclic and block cyclic. Dynamic scheduling assigns work at runtime. One example of self-scheduling creates a queue which contains all task blocks. Blocks are then subsequently assigned to individual processors and, when they have completed that assigned block, they can then ask the scheduler for more work blocks. Self-scheduling algorithms incur a communication overhead between the child and master processes. There is also a period of time where the process is idle, whilst the request is being processed and then staged. A number of algorithms have been designed to reduce the communication overhead by varying the chunk size. One such example is the guided self-scheduling [27].

As we have seen in Chapter Four, both Nehalem when in dual processor configuration and the CELL BE have NUMA characteristics. Single Nehalem exhibit cache coherent unified memory access as individual cores have the same access time to cache and RAM. When in the case of the dual processor configuration, each processor has a memory controller with direct access to its own RAM. The processors are connected to each other using Intel’s Quick Path Interconnect technology [28]. Both core L2 cache are coherent but non-uniformed in terms of access time to the separate RAM controllers. Each SPE on the CELL BE is attached to the same system RAM, and so it is therefore defined as unified memory architecture. However, SPEs are non-cache coherent and ultimately rely on the programmer to maintain non-contended access to memory locations. Scheduling on NUMA architectures should take into account the latency incurred in accessing memory [26].

3.2.1

Assessment of the N-Body Problem

Many tasks including the N-body problem have a differing computational requirement based on the index being processed. When processing an N by N matrix given any index of N, the computation will then be the same for all N. However, when dealing with the N-body problem,

(39)

39

we see that calculating Index 1 will require greater processing than 2, and an exponentially greater quantity of processing than Index N.

Figure 3.1: Computation level given index

In the case of the N-Body problem, this imbalance is caused by the nature of the force calculations for each body in the system.

Figure 3.2: 5-Body Example

In Figure 3.2, we can see that, during Step 1, body Index 1 needs to calculate the force acting between it and all other bodies in the system. Note that, in the future iterations, particles will not

(40)

40

need to calculate the force acting between them and any previous particle that has obtained force calculations. This is because previous body calculations will have updated the force it is applying to the current body. The number of calculations required for each step is equal to the value of the index subtracted from the number of bodies. Figure 3.3 shows this relationship.

Figure 3.3: Body index calculation ratio

One of the more basic scheduling techniques statically assigns a proportion of the complete task to individual nodes. Two common schemes are block and cyclic[25].

3.2.2

Block and Cyclic Scheduling

In Figure 3.4, the block cyclic show dividing the number of the index dimensions by the available processors using a static equal in order distribution. For the N-body problem, the load imbalance would result in the first workload group S1 in Figure 3.4 being allocated to a single processor. This would consequently result in a load imbalance. Processor 4-allocated workload S4 would be finished 7 times faster than that of Processor 1 running S1. The process of dividing

0 500 1000 1500 2000 2500 1 47 93 ₁₃₉ ₁₈₅ ₂₃₁ ₂₇₇ ₃₂₃ ₃₆₉ ₄₁₅ ₄₆₁ ₅₀₇ ₅₅₃ ₅₉₉ ₆₄₅ ₆₉₁ ₇₃₇ ₇₈₃ ₈₂₉ ₈₇₅ ₉₂₁ ₉₆₇ Force Calculation s Body Index

(41)

41

and managing the allocation of work is a trivial task, with each process calculating the offset to its section of work.

Figure 3.4: Block cyclic schedule

Each process can easily calculate its own subtask within the given body set. The overhead of this computation will be low, and will require no inter-process communication.

Set process to current thread number

Set start to N divided by number of threads multiplied by process minus one

Set end to start add task size

If total threads modulus task size > 0

If process <= total threads modulus task size Set start = start add process minus 1

Set end = end add process Else

Set start = start add total threads modulus task size

Set end = end add total threads modulus task size End If

End If

Figure 3.5: Pseudo code for in process offset calculation

When calculating work offset, it is important to take into account the remainder when dividing work between processes; not doing so will result in an incorrect work schedule (see Table 3.1).

(42)

42 Processor 1 2 3 4 Without modulus Start 0 25 50 75 End 24 49 74 99 Work 25 25 25 25 With modulus Start 0 26 52 78 End 25 51 77 102 Work 26 26 26 25

Table 3.1: Offset pattern for 100 elements

Cyclic scheduling would provide a fairer distribution of work, consequently permitting each core to process each index as an offset from the first allocated; this would provide a theoretical best allocation. However, it is noteworthy that this scheme would cause cache coherency problems. The Nehalem has a 64-byte cache line which holds a copy of a specific RAM location. If required by another core, the line could then be evicted, which would consequently cause contention. If a cyclic schedule allocates Index 1 with a double precision value to process 1, and Index 2 with a separate double but existing in the same cache area, this will cause contention. The perfect schedule for cache coherent cores would be to size each block so that the index values of the block reside in a single cache line. Such a scheduling scheme has been developed by [26].

(43)

43

Figure 3.6: Cyclic Schedule

A seemingly good scheme for the N-body problem would be block cyclic equal area schedule shown in Figure 3.7.

Figure 3.7: Equal area schedule

However, this may provide too perfect a scheme, as it would divide total computation equally between the available processors. As outlined in Chapter Two, memory access times across the processing elements are not uniform. Accordingly, equally dividing computation across the

(44)

44

devices may ultimately lead to latent processing (cores waiting for data to process from RAM), which will result in staggered completion of the equal computational loads.

3.2.3

Dynamic Scheduling

Dynamic scheduling allows processors to adapt their load balance to meet the requirements of changing the computational requirements or the changing state of the underlying hardware. One of the popular dynamic scheduling techniques—simple self-scheduling—establishes a Master-child relationship. The master maintains a queue of work tasks, which are subsequently allocated to the child, which computes slaves as they finish their current task. This type of dynamic scheduling works well when there are no mechanisms to determine the task size or dynamics. Usually, the master requires that access to the queue of tasks by the client be restricted by a lock. If the task size is small, then processors will spend a high proportion of time contending for the lock. If the tasks size is too great, there could then be a load imbalance. If we know the quantity of work ahead of time to create the optimum work distribution, this would then provide a better balance between master lock contention and load imbalance. However, if so much is known about the task size, then we can forgo the dynamic schedule and accordingly implement a static scheduling which has a much smaller overhead.

In [29], the authors describes the use of DAG composition over a static list with tasks being connected by a graph which specifies the communication weight. This could be useful for scheduling the same computational element across different heterogeneous elements that have differing communication overheads. It has been shown in the previous chapter that the communication costs of the GPU will be high if there is insufficient computation on the GPU.

(45)

45

Figure 3.8: DAG approach[29]

The establishment of DAG would incur overheads and may be more suited to application where computation is unknown at the start and additional tasks are creating during computation. This is not a characteristic of the N-body problem. In [30], the authors show that a work-stealing approach to the redistribution assignment of DAG within a thread group improves if stealing is determined by the locality to work which was originally on the processor. This technique could be applied to the block scheduling technique, thereby permitting block allocation and subsequent theft of work in order to ensure proper distribution. Theft through offsetting would ensure data locality.

In [29], the authors describe problems as either deterministic or non-deterministic: deterministic is defined as where precedence relations exist between the tasks and the relationship between the tasks is known in advance; non-deterministic describes tasks where information relating to execution characteristics is only known at runtime.

The N-body problem is deterministic in that we know the problem domain to be static, as well as the computational requirements. However, the application of a deterministic problem to what is a deterministic receptor would lead to treating the deterministic problem as a non-deterministic owing to the underlying heterogeneous architecture.

(46)

46

3.2.4

OS Thread Scheduling

Applications can detect a number of hardware cores in a system and then use a scheduling algorithm to divide the computation between those cores. It is the responsibility of the operating system to assign those processes and the associated computational load to the individual hardware cores; even on a system which is dedicated to a single computational task, there will be contention for the individual cores. Sources of contention will primarily stem from background services and operating system functions. Accordingly, we cannot rely on any optimisation of thread placement by the underlying operations system. On a system which has a number of tasks running, we could find two computational threads placed on the same processor core; this would cause a serious imbalance on the single core which is running double the intended computational load. Notably, static scheduling techniques would fail to address this imbalance, although a dynamic schedule could adapt to the changes.

3.2.5

Profiling

Profiling can be characterised as design time or runtime. Design time profiling involves the analysis of the current application to determine the areas of the computation which are responsible for consuming the majority of the total compute time for a given application. Usually, an application is written in order to meet various requirements, as its output is validated against those requirements. Once functional, the application can then be further reengineered in mind of performance. The first step in the reengineering process is to profile the application to identify the areas which can be redeveloped to obtain an increase in performance. Often, this engineering process is targeted for a certain architecture type. This is an issue in modern computing environments where many different heterogeneous architectures can be found. Runtime profiling can analyse the application, as it executes on the target architecture and tunes the algorithm for that hardware combination.

(47)

47

3.3

Chapter Summary

In this chapter, we have identified that, given a known task size as provided by the N-body problem, static scheduling would provide a suitable approach for dividing the work between the available elements. However, we have further determined that, given an environment with many different heterogeneous hardware types, static scheduling techniques alone would be unable to determine the best pattern of instantiation. Runtime profiling would enable us to perform tests of the static schedule in order to determine its effectiveness and to subsequently select its optimum instantiation pattern.

(48)

48

Chapter 4

Heterogeneous Development Methodologies and

Patterns

4.1

Chapter Overview

This Chapter will survey current development methodologies in an attempt to identify a work plan for the remainder of the project. It is essential that the stages of heterogeneous development be fully understood and formalised into a structured methodology. In order to assist in the design of a solution to the load-balancing and profiling problem, it is essential that the patterns in existence for heterogeneous development be reviewed. It is well understood in the development community that, where possible, patterns for common implementations should be followed. This chapter will identify a development methodology and a development pattern in mind of assisting the realisation of the application code.

4.2

Development Methodologies

Development methodologies have become ubiquitous in their use of assisting system development teams in producing quality applications which meet the requirements of end users. These frameworks have been developed over time in response to failed projects. Moreover, the development of standard working practices in the development of software has led—in the majority of cases—to better and more reliable software. One such methodology—i.e. the waterfall model—breaks the development cycle into a number of phases (Figure 4.1). This

(49)

49

approach is common by splitting the development of software into discrete phases, and accordingly ensuring that each phase is complete before stepping to the next. We have also been able to strike commonality between phases of different methodologies, thereby showing that, in principle, at a high level, the same work is being carried out. Methodologies are followed for the development of new software or the redevelopment of old.

4.2.1

Software Development Phases

The first phase is concerned with capturing the requirements which users will have of the completed software. Usually, an application is developed with the objective of fulfilling a number of specific requirements, e.g. process online payments. When complete, the application will be tested against these requirements so as to determine its fitness for purpose. HPC application development is usually concerned with the redevelopment of an algorithm or existing application to a HPC platform. The sole requirement in most cases is application speed-up; that is, the reduction in application runtime compared with a previous configuration.

For example, a video compression application which runs on a single-threaded processor might need to be redeveloped so as to make use of a multi-core device, with the core requirement of making the video compression faster. Originally, the application was developed to meet the requirements of the user, and was tested and deemed to have passed that requirement. Now, the application is to be re-engineered in such a way to run faster on a multi core device. There is also the possibility that the application has been developed to run on a multi-core device, but is now required to run on a GPU. Thus, we can typically define our main development requirements to be concerned with temporal improvements.

(50)

50

Figure 4.1: Waterfall Model[31]

Once requirements have been defined, we are subsequently concerned with the analysis of the problem and the design of the solution. Analysing the current performance of the application would require runtime profiling. Notably, there are a number of tools available for identifying the intensive sections of the application [32]. In this project, we will make use of the Gprof profiler. Gprof is a call graph extension profiler which monitors the percentage of time spent in each function [33]. This tool provides us with a good understanding of the intensive sections of the application; we can then target those functions to be processed in parallel.

Before we start to think about designing an algorithm to support the parallel computation of the intensive section, we need to first determine if that section is a problem which can be run in parallel. In order to assess the suitability of the algorithm, we can analyse the algorithm and the math underpinning those algorithms in order to determine whether there is independence in

(51)

51

computation. Essentially, if there is a serial dependence, the likelihood of separating computation is then low unless we are able to redesign the algorithm without losing fidelity.

Algorithms are usually developed by scientists in an iterative process. Segal, in [34], documents the results of over 10 years’ observations on the development of software by the scientific community. Notably, Segal studied earth and planetary scientists, and structural biologists. Segal notes that many that develop scientific software have no or little training in software engineering, and that the approach to development does not tend to follow any specific methodology and is more of an ad-hoc iterative process. Moreover, software is typically constructed by a scientist or by another scientist working in the same laboratory. However, such scientists are commonly experts in their own field, and the algorithms are produced to their requirements and accordingly verified through valid output. However, it is not expected that the algorithms be written in a way to best utilise the computational resource. In [35], the authors state that the goal of a scientist is to produce new scientific knowledge; optimising software which runs within their assigned computational quota is not a requirement. Optimisation is only required when the desired fidelity is not met within their allotted resource. Therefore, when changing an algorithm, it is important that the mathematical approximation currently required is not changed.

4.2.2

Algorithm Design Considerations

Laxmikant V Kal´ in [36] suggests a number of dictums learned through extensive work in the development of parallel applications. Dictum 1 recommends the over-decomposition of migratable computational objects. Decoupling the computation from a single processor or number of processors eases the deployment of the algorithm. Kal also suggests that the use of a runtime system to deploy the computation to the available resource decoupling the algorithm from the hardware will enable this redeployment. Dictum 2 details the use of automated load-balancing via measurement. Separate chunks of computation can be allocated to individual

(52)

52

processors or groups of processors. Following Dictum 1 will assist in the dynamic load-balancing of computational blocks required by Dictum 2. Dictum 3 states that using asymptotic and iso-efficiency analysis to redesign the algorithm will ultimately provide scalability, thereby resulting in speed-up for an increasing number of cores. Dictum 4 mandates the use of performance analysis tools, and although this is aimed at many thousand core computers, previous research indicates that performance analysis is just as important for single core systems so the dictum holds for both large and small systems. Dictum 5 details the use of system emulators to prepare and analyse the algorithm before real system application. The Nvidia system provides a system simulator to compile an application to be simulated, and issue the make command with the emu=1 flag. IBM provides a full system simulator which actively boots a virtual Linux image that has access to a virtualised CELL processor.

In [37], the authors suggest the following important stages in the development of parallel algorithms:

1. Subtask Decomposition 2. Dependence Analysis 3. Scheduling.

Splitting the algorithm and accordingly developing subtasks of computation whilst finding and eradicating dependency between the computational blocks. Often, the access to dependent or shared data will subsequently result in a communication pattern and/or lock contention between processors. The authors in [37] further detail the importance of scheduling in order to ensure processor utilisation.

In order to assist HPC application development, a number of languages have been proposed to ease the development of multi-threaded applications across systems with multiple computational elements (see Table 4.1).

(53)

53

Technology Organisation Coprocessor

support Dataparallel Taskparallel Threadsupport Language/Library Current targetArchitectures

Brook1 AMD Y Y Y Y Language (C) AMD GPU

Chapel Cray Y Y Y Language Multicore CPU

Cilk++ Cilk Y Y Y Language

(C11) Multicore CPU (shared memory

x86) Co-array

Fortran Standards Body Y Y Language (Fortran) Multicore CPU

CUDA NVIDIA Y Y Y Y Language (C) NVIDIA GPU

Fortress Sun/Open

source Y Y Y Language Multicore CPU

OpenCL Standards Body Y Y Y Y Language (C) GPU

PVL MIT-LL Y Y Library (C11) Multicore CPU

PVTOL MIT-LL Y Y Y Y Library (C11) Multicore CPU,

Cell, GPU, FPGA

Sequoia Stanford Y Y Language (C) Cell

StreamIt MIT Y Y Y Language Multicore CPU

Titanium UC Berkele Y Y Language

(Java ) Multicore CPU

UPC Standards Body Y Y Language (C) Multicore CPU

VSIPL11 Standards Body Y Y Library (C11) Multicore CPU,

NVIDIA GPU, Cell, FPGA

X10 IBM Y Y Y Language

(Java ) Multicore CPU

Table 4.1: Parallel languages[38]

Additional languages are being developed in order to address the failings in general purpose programming languages when used in parallel development. Failings are due to changing architectures which standard languages were not developed for. An important step in the design or redevelopment of an algorithm is to determine if the current language which the algorithm is written in is considered suitable for the target architecture. It may be worthwhile to redevelop the algorithm in a parallel language. Moreover, it is important to ensure that the lead time for development is considered, and that all target architectures are compatible with the parallel language chosen. Many languages—e.g. Cilk++—have been developed with single-target architecture in mind.

(54)

54

4.3

Patterns

Patterns have been used for a number of years in software engineering to add rigour to the software engineering process. Riechle & Züllighoven define a pattern as, ‘the abstraction from a concrete form which keeps recurring in specific non-arbitrary contexts’ [39].

Patterns are abstract concepts or frameworks for tackling complex algorithmic problems. When developing an application, it can be useful to use a pattern that is proven to solve a problem. Patterns are abstract—not concrete—and so require application to the problem in order to create a concrete algorithm. In Figure 4.2, we are able to see an example pattern for the message passing action commonly found in parallel application development.

Figure 4.2: Message passing pattern[40]

As we can see, the pattern in Figure 4.2 contains many important elements which, although abstract, would assist in the development of an algorithm to support messaging passing. In Figure 4.3, we are able to see that one of the most common patterns divide and concur. In this pattern, we produce individual sections of computation and then accordingly merge the result of

(55)

55

each individual piece of computation. As outlined in the previous chapter, load-balancing would be required in order to allocate computation to the individual processing elements.

In [41], the authors detail various mechanisms to implement the divide and conquer algorithm:

Fork join: The main serial thread of the application must create (Fork) individual threads on each processor core. The main thread must then wait for e

Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures