2.3 Key Runtime Aspects in a Semi-Explicit Model
2.3.3 Data Locality
Data locality refers to the use of data elements within relatively close storage locations, i.e. the arrangement of related tasks and their data close to each other as much as possible. This can improve parallel program execution time signifi- cantly by reducing the fetching time of remote data between threads. This can be achieved by executing tasks close to the data they need. A well-known fact to all parallel implementers is that the communication overhead can dramatically affect the performance of parallel computations [100]. This problem arises from small computations demanding remote data, causing a communication overhead much larger than the computation. Thread allocation and thread placement decisions are even harder on heterogeneous multicore architectures. Many parallel pro- gramming models use sophisticated adaptive algorithms that support dynamic,
lightweight threads. The heart of these models is a thread scheduler that bal- ances the load among the processes. In particular, the processor can execute another thread while one thread is waiting for communication. However, not only a good load balance is essential for high performance, good data locality is an important factor too. One technique is to allow the programmer to specify the data access patterns in the program. This can be quite difficult for complex data access patterns. The most popular load balance algorithm is work stealing, which is based on random scheduling of threads. However, the randomisation in the work-stealing algorithm can work against data locality. The current mod- els implementing work stealing still need to be improved to achieve good data locality.
X10 improves data locality by integrating new constructs (notably, places, regions and distributions) to model hierarchical parallelism and nonuniform data access [30]. X10 splits the memory space into parts known as partitioned global address space. Constructs enable programmers to assign a single place to each global address space. During execution of an activity, it will be located on the same partitioned global address of its place. With this design, X10 claims to minimise the communication between remote nodes.
Unified Parallel C (UPC) is a parallel extension of the C programming language developed for multiprocessors[37]. It is a distributed shared memory programming model. UPC facilities shared-memory programs, while exploiting
data locality. UPC provides constructs for placing data close to thread need it. These constructs provide a mechanism that a private object can only be accessed by its owner thread; any synchronisation with other threads can be made through a shared address space.
Scalable Locality-aware Adaptive Work-stealing Scheduler (SLAW) is a scheduler designed for parallel programming models [43]. The scheduler is currently implemented in the Habanero Java (HJ) language. The goal is to allow programmers or compilers to give locality hints to runtime. Like the X10 model, SLAW achieves locality by grouping workers into places. Each place has a mailbox to receive a remote task from a worker in another place. Each worker within a place has its own queue. Tasks in the mailbox have less priority than tasks in the worker queue. If a worker becomes idle, it looks for tasks in its local queue, then in other worker’s queue in the same place, and finally from the mailbox.
2.4
Summary
A parallel model that can exploit heterogeneous multicore architecture must deal with a parallel heterogeneous programming environment, scheduling on heteroge- neous processors, and with the management of nonuniform memory. Earlier work has focused on spawning threads dynamically using annotations, without taking the distance notion into account while spawning a thread. The distance notion means the communication overhead associated with a spawning thread on remote
processor element. It can be represented by the latency between processors or by the number of communication hubs traversed between PEs as we did in this thesis.
We have shown that existing mechanisms have limited support for improving data locality, which can have a high impact on parallel programming performance. In the next chapter, we will present the first programming and performance com- parison of functional multicore technologies and report some of the multicore results for two languages. The results of the comparison is one of the motivations to introduce new annotations in GpH to improve data locality.
Multicore Parallel Haskell
Comparison
This chapter compares four different parallel Haskell implementations on a mul- ticore architecture. It outlines the key features of each implementation. Finally, we present the findings of the comparison.
3.1
Introduction
There is a long-held assertion that functional languages are suited for parallel computation. The fundamental motivation of this claim is that, because of the lack of side effects, there will often be a chance to evaluate sub-expressions in parallel without risk of interference [52]. Moreover, parallelism can be achieved using high order functions, thus controlling the parallelism and at the same time
hiding the low-level coordination details. The low-level details are managed by the compiler or dynamically by the runtime system.
High level programming languages such as MPI [59], UPC, OpenMP [28] and Pthreads may be implemented on multicore architectures but this is usually not the best alternative for the migration of most mainstream applications [27]. MPI, UPC, and Pthreads require a high amount of code to be reorganised in order to achieve reasonable performance while OpenMP programs require less code reor- ganisation. However, OpenMP does not yet provide features for the expression of locality and modularity that may be needed for multicore applications.
While there are likely to be several successful approaches to multicore pro- gramming, we believe that functional language is relatively convenient for multi- core. Parallel Haskell languages have been successfully deployed on shared mem- ory systems (SMPs) and distributed memory architectures (DSMs).
This chapter presents a programming and performance comparison of func- tional multicore technologies and reports some of the first multicore results for the approaches. The comparison contrasts the programming effort required to specify coordination with the parallel performance delivered in each language. The comparison uses 15 programs carefully selected, i.e. without regard for their inherent parallelism, from representative parts of the Nofib benchmark suite [89]. In consequence, the results reflect the multicore performance that might be ex- pected for a “typical” set of Haskell programs (Section 3.4).
Feedback Directed Implicit Parallelism (FDIP) [46], and three “low pain”, i.e. semi-explicit languages (Section 2.2.4). The semi-explicit Haskells are Eden [74] and two implementations of Glasgow parallel Haskell (GpH) [121], GpH-SMP and GpH-GUM (Section 3.3).
Although the parallel Haskell implementations all share the same optimising Glasgow Haskell Compiler technology (GHC), each uses a different version, and hence the performance comparisons are based on speedups, which are normalised against different sequential performance. We establish a baseline for the speedup comparisons by reporting sequential and parallel runtimes and efficiencies for three of the languages (Section 3.5).
We report detailed parallel performance and programming effort studies, fo- cusing on the number of programs improved, speedups delivered, and program changes required to coordinate parallel evaluation (Section 3.6). The study com- pares the scalability, the programming effort required, and the parallel perfor- mance achieved, in each language (Section 3.7). We conclude by summarising the key results and discussing their implications (Section 3.8).
For the GpH-GUM measurements presented in this chapter, we use an earlier GHC-4.06 version which was upgraded later as a part of work conducted for this thesis and is used to evaluate the architecture-aware constructs in chapter 6. Moreover, it uses the original strategies style [122] to parallelise the benchmark using GpH-GUM. The strategies have been improved in [77]. The next chapter discusses the new strategies in more detail.