Figure 4.1: Single-Core CPU architecture, against which many applications programmers optimise their software.
For many years CPUs have contained a single processing core with cached access to the system memory. This common design is summarised in Figure 4.1. This type of CPU design is easy to write code for as there is only a single core accessing, changing and writing values and one cache
4.2. CENTRAL PROCESSING UNITS
hierarchy. Multi-core CPUs will generally have some form of individual and shared memory cache all connected to the shared system memory [96, 98]. Figure 4.2 shows an example design of a quad- core CPU. Each CPU core has its own L1 cache with L2 cache shared between all four cores.
Figure 4.2: Multi-Core CPU architecture, four cores with their own L1 caches and a shared L2 cache. One of the major challenges which must be overcome by multi-core chip designers is cache co- herency between cores [98, 108]. Modern CPUs rely on several levels of cache to keep the cores sup- plied with the necessary data. The different cores of a CPU have some of these cache levels shared between them and some are specific to that core. When multiple cores are manipulating the same data, these different levels of cache must be kept coherent. This becomes increasing difficult as the number of cores in the CPU is increased. More cores means more caches that must be kept coherent and this will have a negative impact on overall performance of the CPU.
With CPU design adopting multi-core architectures on a large scale [94, 96, 101] multi-threaded programming is becoming increasingly important. Developers can no longer rely on having faster cores to allow for more complex applications and instead must design their programs to make use of multiple-cores. Parallel programming will have to move away from high-level courses learnt by experts and become a more commonly-used technique [94].
There are many languages and libraries available that allow multi-threaded code to be written in a variety of different ways. In this thesis two C-based libraries are discussed - POSIX threads [110] and Threading Building Blocks [111].
4.2.1
POSIX threads
POSIX Threads or Pthreads is a low-level library for using system threads [110]. This library allows for the explicit creation of threads and inter-thread communication. This low-level method of pro- gramming threads can often provide the best performance, however it can require a great deal of parallel knowledge and is also the most error-prone.
When implementing a program using Pthreads, the main considerations are ensuring that the threads co-operate together to perform the computation correctly. As threads must often perform computation using the same data, they must be careful not to interfere with each other. With the Pthreads library, this coordination must be explicitly programmed by the developer. The two main
CHAPTER 4. PARALLEL ARCHITECTURES AND LANGUAGES
constructs used for this aremutexesandsemaphores.
Mutexes allow threads to ‘lock’ data. When data is accessed by multiple threads, a mutex can be used to stop any other threads accessing and modifying that data out of order. Threads can lock a mutex, modify the data and then unlock the mutex again. If a thread tries to lock a mutex while it is already locked by another thread, it will wait until the other thread has unlocked the mutex before it can continue. Through the correct use of mutexes, it can be guaranteed that only one thread will be accessing that piece of data at any time.
Semaphores allow threads to communicate and synchronise with each other. When a thread waits on a semaphore, it will halt until another thread posts to that semaphore to continue. These semaphores have a counter counting the number of posts and waits. A post will increment this counter while a wait decrements it. If the semaphore counter is positive, a wait will continue im- mediately and decrement the counter. If it is 0, the thread will halt until the semaphore receives a post.
These low-level methods of Pthreads are very fast and closely tied to the functionality of the CPU. Unfortunately the correct use of them is hard to learn and race-conditions can easily be introduced into a program. Another approach to multi-threaded programming is to use a higher-level library such as Threading Building Blocks.
4.2.2
Threading Building Blocks
Intel’s Thread Building Blocks (TBB) provides a high-level way of writing a multi-threaded applica- tion. Rather than managing threads explicitly (although this is possible) TBB provides a library of commonly-used parallel constructs and functions [111]. This still requires the developer to have knowledge of parallel programming to identify how the program can be safely parallelised but makes the implementation far less complex and error-prone.
TBB provides parallel algorithms that can be invoked without the need for manually managing
thread synchronisation and data access. One such algorithm isparallel forwhich will iterate in
parallel over some range (specified at run-time) and perform some computation as defined by the
user. Other algorithms such asparallel sortwill use the available cores on the CPU to sort a list
of items. This style of multi-threaded programming is more limited but is much easier to program and learn.
TBB also provides more low-level constructs such asmutexes,task groupsandthreads[111].
Thus TBB can be used for more complex low-level problems but the programmer must address the issues of data access and synchronisation manually, as with the Pthreads library.
TBB contains many different functions and constructs which allows a wide range of parallel pro- grams to be implemented with it. The full functionality of TBB is presented in [111].