Multi-cores - Explicitly Parallel Architectures

Explicitly Parallel Architectures

2.3 Multi-cores

Once the power and verification walls put a limit to the complexity of individual cores, major microprocessors manufacturers shifted their focus to the integration of multiple, smaller and power efficient cores on the same die, considering it the only way to take advantage of ever increasing numbers of transistors available on chip and sustain performance scaling.

The frequency scaling trend has been substituted by core count scaling, with doubling the number of processors, or cores, at each technology generation. The incremental path towards multi-corechips (two, four, or eight cores) has forced the software industry to expose par-allelism to application developers explicitly. Applications developed for multi-cores aim at maximizing performance by distributing their loads on multiple units; legacy code can still be supported on single units, even if their run-time performance is far from optimal.

OS-managed load balancing, in fact, becomes increasingly inefficient as the number of cores scales, because granularity applied at task level is just too coarse to make an efficient use of resources in many cases; consequently, efforts have been undertaken to expose paral-lelism management to application programmers. Two different approaches have been pursued towards this goal: either based on explicit APIs, or relying on code annotation.

MPI (HP Corp.[2007]) is an implementation of the first strategy; it defines library function to support development of parallel code. It is highly successful in the high-end technical com-puting community, where clusters of multi-core processors are common, but requires

consider-10 2.4 GPGPUs

Figure 2.3. A typical CPU and GPU block scheme: the GPU devotes more transistors to data processing (from NVIDIA Corp. [2010b]).

able effort in deriving efficient implementations when applications must be ported to multiple architectures. Even when the target architecture is fixed, application parallelization is far from trivial, as care must be taken to manage communication and access to shared resources to avoid concurrent programming errors like deadlocks.

Goal of OpenMP (Chapman et al.[2008]) is to assist programmers in the parallelization process allowing for incremental code transformations (starting from a sequential version). It relays on pragma annotations to state which sections should be execute parallely, which data should be private and how execution should be synchronized.

2.4 GPGPUs

Multi-cores break the sequential paradigm exposing explicit parallelism to software, but retain fully featured processing elements, able to execute general purpose code; the approach main-tains compatibility with serial applications and does not restrict their generality with respect to single core microprocessors. In multi-cores, core number is restricted by their complexity, as a large amount of control logic makes impossible to effectively lower transistor count per core more than a (rather high) threshold, in turn limiting multi-cores performance when executing applications that expose massive parallelism. Integration of hundreds or thousands of cores on a single die is possible only if dedicated architectures executing special purpose applications are considered; to distinguish them from multi-cores, this family of ICs are termed many-cores.

Many-core architectures use much simpler, lower-frequency cores than the fully-featured processors used in multi-cores, resulting in more substantial power and performance benefits.

Moreover, the presence of many simple computing units, lacking complex control and caching mechanisms, makes it possible to devote a larger portion of resulting ICs to actual computation than is feasible with multi-cores (Figure 2.3).

In many-cores, hundreds to thousands of computational threads per chip are possible, each executing on a small data set in parallel and allocated to individual cores with little inter-core communication. Long-latency loads and stores to main memory can then be masked by shuf-fling threads execution on the computing units, instead of deep cache hierarchies. Ultimately, these architectures enable an exponential growth in explicit parallelism, and their performance increase is greatly outpacing general purpose microprocessors, as shown in Figure 2.4.

11 2.4 GPGPUs

Figure 2.4. Floating-Point Operations per Second for the CPU and GPU (data from NVIDIA Corp. [2010b]).

Obtaining maximum speedup out of an application mapped on a many-core platform re-quires for application developers knowledge of the underlying architecture. In particular, prob-lems must be properly parallelized and care must be taken to adapt them to many-cores’ tricky memory structure, which is explicitly exposed to the programmer and is more akin to software-managed scratchpad memories than traditional caches.

The industrial incarnation of the many-core paradigm is the General Purpose Graphics Pro-cessing Unit. GPGPUs evolved from graphic processors in the last decade, when GPU man-ufacturers replaced fixed custom pipelines of previous generation GPUs with a mesh of more general-purpose processing cores. Whereas traditional GPUs only specialized in drawing an im-age data to the screen, modern GPGPUs cores can be programmed, using a variant of C Code.

The most popular software environment for GPGPU programming are C for CUDA (NVIDIA Corp.[2010a], a proprietary framework developed by NVIDIA) and OpenCL (NVIDIA Corp.

[2010b], a collaborative effort managed by the Khronos Group). Both frameworks enable application developers to directly interface with many-core hardware and execute parallel ap-plications on it.

GPGPUs are able to address problems presenting high data parallelism and arithmetic in-tensity, which is the ratio of arithmetic operations with respect to control and memory ones.

Cluster of cores in a GPGPU execute in a SIMD (Single Instruction Multiple Data) fashion, minimizing the required control flow logic. Many applications that process large data sets can use a data-parallel programming model to speed up computations: in traditional GPU applica-tions, like 2D and 3D image processing, large sets of pixels and vertices are mapped to parallel threads; added flexibility in GPGPUs made possible for many algorithms outside the field of image rendering and processing to be parallelized.

12 2.5 Field Programmable Gate Arrays

Figure 2.5. Field Programmable Gate Array Evolution (from Xilinx Corp.).

Increasing interest in porting non-graphics applications to GPGPUs is a consequence of the growing gap in peak performance between microprocessors and graphic units. Even if extract-ing maximum performance from GPGPU hardware requires considerable effort, encouragextract-ing results have been reported in accelerating scientific workloads, like molecular dynamics (Elsen et al.[2006]) and Monte Carlo simulations (Preis et al. [2009]).

In document Architectural exploration and scheduling methods for coarse grained reconfigurable arrays (Page 31-34)