MULTICORE PROCESSORS AND SYSTEMS: A SURVEY

(1)

MULTICORE PROCESSORS AND SYSTEMS: A SURVEY

by

DM Rasanjalee Himali Ruku Roychowdhury

A Survey Submitted in Partial Fulfillment of the Requirements of Advanced Computer Architecture CSC 8210

(2)

Abstract

A multicore architecture can be described as an integrated circuit with two or more individual processors which are called cores. The implementation of multicore processors are numerous and diverse. A significant performance advantage as well as improved power consumption has been observed in multicores in recent years. To achieve the advantage of multi-cores, software should contain components that can be parallelized to run on multiple cores simultaneously. The parallelization of software is a significant on-going topic of research. With the increasing capability of multicores to execute multiple threads at parallel to achieve high speed-ups, programmers need to design codes for execution by thousands of processes or threads. One of the major considerations here is how to write programs that can scale to hundreds of thousands of threads. A vast amount of research as been conducted in this area and as a consequence very popular commercially available programming languages and platforms have been developed in the recent past. Resource management is another hot topic that is popular in the research community today. Most of the academic work in this area has focused on modeling contention for LLCs, as this was believed to have the greatest effect on performance. Measuring multicore performance requires new ways of benchmarking. Once new such platforms are devised, there need to be new methods for interpreting the results. Many traditional benchmarks developed for multiprocessor are equally applicable to multiprocessors. There are also numerous new benchmarks proposed by researchers targeted for multicore community.

In this survey, we cover the research covered in the above motioned areas with current and future commercial multicore designs. Our main focus of this survey is to present and discuss research conducted in the areas of, architectures of multicores, resource management of multicores, parallelization and programming in multicores, measuring performance of multicore platforms and future challenges in multicores. We also discuss commercially available tools and platforms for multicore programming and testing.

(3)

TOPIC INDEX

Page

1. Introduction……….. 4

1.1 Multicore Architectures………4

1.2 The Need for Multicores……….. 5

1.3 Multicore Architecture Classification……….. 6

1.3.1 Classification based on Application Class………..7

1.3.2 Classification based on Memory System………7

1.4 Popular Multicore Architectures……….. 8

1.5 On-Chip Interconnections……… 8

1.6 Future of Multicore……….. 9

2. The Conceptual Architecture of Multicores………. 11

2.1 Motivation………11

2.2 Model of Computation ……….11

2.2.1Process Based Models………. 12

2.2.1.1 Process Network………12

2.2.1.2 Synchronous Data Flow……….14

2.2.1.3 Process Calculi……….. 15

2.2.2 State Based Model……… 16

2.2.2.1 Finite state machine ………. 16

2.2.2.2 Hierarchical and Concurrent Finite State Machine……….. 16

2.2.2.3 Program State Machine………. 16

3. Resource Management of Multicore Systems………. 18

3.1 Shared Memory Contention in Multicores……….. 18

3.2 Shared Memory Management Strategies ………19

3.2.1 Cache Partitioning Strategies ………19

3.2.2 Contention Aware Scheduling Strategies……….. 29

3.3 Power Management in Multicores……….. 33

4. Parallelization with Multicores……… 37

4.1 Background………. 37

4.2 Design spectrum of parallelization………. 37

4.3 Types of Parallelism……… 40

4.3.1 Task Parallelism………. 41

4.3.2 Data Parallelism………. 41

4.3.3 Pipelining……… 41

4.3.4 Structured grid……… 42

4.4 Multicore Programming Platforms………. 42

5. Measuring Multicore Performance……….. 47

5.1 Traditional Benchmarking Methods……… 47

5.2 Multicore Benchmark Criteria………. 47

5.3 SMP Based Multicore Benchmarks………. 48

6. Conclusion and Future Challenges of Multicores……… 59

6.1 Conclusion………59

6.2 Future Challenges of Multicores……….. 61

6.2.1 Software Challenge……… 61

6.2.2 Programmer’s Challenge……… 64

6.2.3 Hardware Challenge ……….. 65

(4)

1. INTRODUCTION

As personal computers have become more prevalent and more applications have been designed for them, the end-user has seen the need for a faster, more capable system to keep up. Speedup has been achieved by increasing clock speeds and, more recently, adding multiple processing cores to the same chip, called multicores. In this chapter we give a through introduction of multicores, their architectures, application areas and challenges associated with multicores.

1.1. Multicore Architectures

Multicore architectures, compared to traditional single core architectures, replicate multiple processors in a single die. Multicore processors are Multiple Instruction Multiple Data (MIMD) architectures in that different cores execute different threads on different parts of the memory. Multicores are generally shared memory architectures. The L1 caches are usually private to cores. L2 caches are private in some architectures and shared in others.

Figure 1.1: Multicore Architecture

Figre 1.1 depicts a general multicore architecture. The close proximity of multiple CPU cores in a same processor chip allows cache coherence circuits to operate at a higher clock rate. Cores in the die run in parallel. Within each core, threads are time sliced similar to a uniprocessor. Intel call these hyperthreads. The operating system provides each of these cores as a separate processor and maps threads to different cores. Most of the major operating systems such as Windows and Linux support multicores today. The composition and balance of the cores in multi-core architecture show great variety. Some architectures use one homogeneous core design for all cores, while others use a mixture of different cores, each optimized for a different, heterogeneous role.

(5)

1.2. The Need for Multicores

The traditional single core architectures can no longer significantly increase processor performance by frequency scaling. For general-purpose processors, much of the motivation for multicore processors comes from the greatly diminishing gains in processor performance. This comes from three main reasons:

(i) The memory wall

There is an increasing gap between the processor speed and the memory speed. To mask this memory latency, cache sizes needs to be larger. However, this is not a scalable solution in that because it helps only to extent that memory bandwidth is not the bottleneck in performance.

(ii) The ILP wall

It is becoming harder and harder to find enough parallelism in a single instruction stream to keep a higher performance single core processor busy.

(iii) The power wall

Each factorial increase of operating frequency results in an exponential increase in power consumption. The power wall poses manufacturing, system design and deployment problems that have not been justified in the face of the diminished gains in performance due to the memory wall and ILP wall.

In addition to these, deeply pipelined circuits in single core architectures result in heat problems, speed of light problems and difficulties in design and verification. Also, many new applications are multi threaded and the general trend in computer architecture has been a shift towards more parallelism.

Therefore computer architects needed a new approach to improve performance. An excellent solution was performance scaling through parallel processing using multicores. Multicore is a relatively new concept. Adding additional cores to the same die would, in theory, result in twice the performance and dissipate less heat, though in practice the actual speed of each core is slower than the fastest single core processor. Multicores are capable of eliminate the power wall problem by reducing power consumption by voltage scaling, eliminate memory and power wall problems by reducing DRAM access with larger caches, and allowing multiple threads in multiple cores to process simultaneously.

(6)

Multicore architectures come with many advantages. These include low power consumption, low heating and smaller device sizes. Also, the proximity of multiple CPU cores on the same die allows the cache coherency circuitry to operate at a much higher clock-rate than is possible if the signals have to travel off-chip. When signals travel on-chip instead of off-chip, the signal quality remains high because signals are shorter and degrade less. This large performance gain is likely to be noticed more while running CPU intensive processes in term of faster response times. For example, if the automatic virus-scan runs while a movie is being watched, the application running the movie is far less likely to be starved of processor power, as the antivirus program will be assigned to a different processor core than the one running the movie playback.

Multicore architectures have certain disadvantages too. The migration to multicore devices requires complex changes to system and software to obtain optimal performance. To optimize a multicore processor for given set of resources and applications, the operating system as well as the applications needs to be adjusted accordingly. Also, the performance gain seen from multicore processors greatly depends on the use of multiple threads in the application. Many technologies these days provides multicore support: Valve Corporation's Source engine, Emergent Game Technologies' Gamebryo engine and, Apple Inc.'s latest OS, Mac OS X Snow Leopard are some of the popular ones. Also, simply doubling the core frequency does not double performance.

1.3. Multicore Architecture Classification

Multicores can broadly be classified into homogenous and heterogeneous multicore architectures. Homogeneous multicores are the ones with cores of the same ISA whereas multicores with different ISAs are called heterogeneous multicore architectures. Homogeneous multicores implement multiple identical cores. The current trend actually is towards homogeneous multicores. Figure 1.3 shows basic designs in multiple CPU systems. According to [5], multicore systems can also be classified based on five most distinguishing attributes: the application class, power/performance, processing elements, memory system and accelerators/integrated peripherals. In this survey explore some of these categories:

(7)

1.3.1. Classification based on application class

The multicore architectures can be designed to reflect the target application domain. Applications for multicore architectures broadly fall into two categories: data processing dominated and control dominated.

Data processing dominated applications contain many familiar types of applications including graphics rasterization, image processing, audio processing, and wireless baseband processing. The common feature in these applications is that the computation involves a sequence of operations applied to a data stream with minimum data reuse. Also, these streaming applications which require high throughput and performance are good candidates for parallel operations and they favor designs that have as many processing elements as practical in regards to desired power/performance ratio. On the other hand, examples of control dominated applications include file compression/decompression, network processing and transaction query processing. These applications contain considerable amount of conditional branches in their code and high amount of data reuse. These types of applications favor a more modest number of general-purpose processing elements to handle the unstructured nature of control dominated code.

1.3.2. Classification based on memory system

Based on memory system used, multicores can be classified into three main categories: distributed memory architectures, shared memory architectures and hybrid memory architectures. Figure 1.2 shows a multicore architecture classification based on memory designs. In distributed memory architecture, each core typically has its own private memory. The communication between cores usually happens over a high speed network. The most common architecture however is shared memory architecture. Here, the memory is shared by all the cores. In a hybrid architecture, both a share memory and private memories per core exist.

(8)

1.4. Popular multicore architectures

There has been massive growth in multicore market. A wide variety of multicores have been developed over the past recent years for the commercial market. Table 1.1 [5] lists some of the general purpose architectures and their characteristics. The multicore architectures: AMD PHENOM [6, 7], INTEL CORE 17 [8,9] , SUN NIAGARA [10,11] and INTEL ATOM[12] processors are general purpose multicore architectures. All four are homogeneous architectures with large caches. These four processors are for general purpose desktop and server application where power is not an overriding concern. On the other hand, ARM-CORTEX [13] and XMOS-XS1[14] are intended for general purpose mobile and embedded market. These are also homogeneous architectures that are well suited to control dominated applications. Since theses are developed for embedded/mobile market, many run from batteries thus making power an overriding concern. Table 1.2 shows some multicore architectures that are intended for high performance applications. Therefore they employ larger number of cores. For example, AMD RADEON R700 [15] contains 160 cores while NVIDIA G200 [16] contains 240 cores.

Table 1.1: General Purpose Multicore Architectures

Table 1.2: High Performance Multicore Architectures

1.5. On-Chip Interconnections

There have been several proposals and implementations of high-performance chip multiprocessor architectures [72, 73, 74, 75]. The proposed interconnect for Piranha [72] was an intra chip switch. Cores in Hydra are connected to the L2 cache through a crossbar. In both cases, the L2 cache is fully shared. IBM

(9)

Power4 [75] has two cores sharing a triply-banked L2 cache. There have been recent proposals for packet based on-chip interconnection networks [76,77]. Packet based networks structure the top level wires on a chip and facilitate modular design. Modularity results in enhanced control over electrical parameters and hence can result in higher performance or reduced power consumption. These interconnections can be highly effective in particular environments where most communication is local, explicit core-to-core communication. However, the cost of distant communication is high. Due to their scalability, these architectures are attractive for a large number of cores.

1.6. Future of Multicores

(i) Improved memory system

There is an enormous need for increasing memory in multicore systems with numerous cores in a single chip. Today, 32-bit processors, such as the Pentium 4, can address up to 4GB of main memory. With cores now using 64-bit addresses the amount of addressable memory is almost infinite. This situation needs to be significantly improved for multithreaded multiprocessors to provide more main memory and larger caches.

(ii) System bus and interconnection networks

The interconnection between cores is very important should be a major focus of the chip manufacturers to improve the time required for memory requests. Improved interconnection network and system bus results in faster networks and thus low latency in both inter core communication and memory transactions. Some of the current approaches include Intel’s Quickpath interconnect [17], a 20 bit wide bus running between 4.8 and 6.4 GHz, AMD’s new HyperTransport 3.0 [18] a 32-bit wide bus and runs at 5.2 GHz.

(iii) Parallel programming

One of the major challenges with multicores is parallel programming in multicores. Programmers should know how to write parallel programs instead of sequential programs that are capable of running on multiple cores in parallel. However, developers of software for multicores should be concerned with application requirements. For example, a programmer should be able to specify priorities for tasks assigned to different cores. Also programmers should be provided with sophisticated debugging tools to debug programs that run on multicores. Also software developers should provide methods to guarantee that the entire system

(10)

stops and not just the core on which an application is running. These issues need to be addressed along with teaching good parallel programming practices for developers.

(iv) Starvation

Proper load distribution between cores is an important factor in multicores. If the program does not do a fair load distribution between the cores, one or more cores may starve for data, while some will be overloaded. Also, with a shared cache if a proper replacement policy isn’t in place one core may starve for cache usage and continually make costly calls out to main memory. The replacement policy should include method to evict cache entries that other cores have recently loaded. This becomes more difficult with an increased number of cores effectively reducing the amount of evictable cache space without increasing cache misses.

(11)

2. CONCEPTUAL ARCHITECTURE 2.1. Motivation

Multicore design process is getting complex day be day due to high performance requirements and too many design constraints. Heterogeneous multi-core systems are built of several different processing units, e.g. processors, memories, buses and various other communication interfaces. To add with, there are many possible HW/ SW partitioning schemes as well as parallelization techniques are also involved in the design process. Therefore, design of such systems is extremely complicated and time consuming. The system specification has to be clear and unambiguous as well as provide sufficient expressive power to automate the design process. Concepts and techniques involved at the specification level can affect quality, accuracy and rapidity of results. Therefore, specification model is best to check these validations and visualize analyzability.

2.2. Model of Computation

Models of Computation (MoC) [2] act as the building block for defining the multi-core system behavior. There can be specific requirements for specific multicores imposed on the MoC such as Heterogeneous multi-core systems are comprised out of different processing elements and generally need an efficient concurrent model of computation. Therefore, facilities to express the concurrency behaviour are very important.

With the aim of making most of the concurrency inherent for the multi-core systems, it is convenient for an application specification to be described in a parallel form. In this way, programming of a multi-core platform can be done in a systematic and automated way. Moreover some other factors as such determinism, predictability, secure and dependable operation through time are equally important as the functionality itself.

MoC can handle both fine and coarse level granularity. In the fine level of granularity, basic entities correspond to individual instructions or statements. In contrast, on a coarse level of granularity, basic entities correspond to the entire blocks of code. Therefore, on the coarse level of granularity, based on the

(12)

basic entities and corresponding composition rules employed, MoC can be categorized into two types, i.e. process based- and state based-MoC.

2.2.1. Process-oriented MoC

Process oriented MoC [3] describe the system behavior in a form of either concurrent processes that communicate with each other by means of message passing channels or shared memory facilities. In process oriented MoC, stress is put on describing the concurrency explicitly. Such specification is appropriate for implementation on the multi-core platform, which inherently uses concurrency. A deterministic MoC produces the same output whenever it is executed for the same input set. Such behavior is highly desirable in order to perform system behavior validation, but fully deterministic model can sometimes result into over specification. More specifically, the ideal global system should be comprised of several processes to produce the same outputs for a given set of inputs, but the order their execution should remain random. In order to cope with these constraints and requirements, different process based MoC were proposed throughout the years. Some of the most prominent process based MoC include Process Networks, Dataflow Models and Process Calculi.

2.2.1.1. Process Network

The main characteristic of a Specialized Process based MOCs is that they can produce dynamic behavior globally while non determinism at an execution of single process level [2]. In KAHN Process networks, Communication happens by unidirectional and point-to-point message passing channels. Such message passing channels are implemented with buffers, which enable asynchronous communication on the server side. The communication channels are unbounded at server so senders can never block. On the other hand, receiver processes always block until required input data from the channel are available. A specific process can wait for a single channel at a time and cannot check the channel without blocking it. Therefore, sequence of channels accesses is predetermined and processes cannot change their behavior in terms of the order upon which data are available on particular channels at run time. Such behavior ensures deterministic system behavior, which does not depend on the order in which processes are scheduled. KPN allow deadlocks, but terminate on the global scale when a deadlock occurs in the system.

(13)

Fig.2.1 KAHN Process Network

Scheduling strategy is another important aspect of KAHN Process Network which directly affects completeness and memory requirements. The two basic policies are given below:-

(i) Demand driven scheduling: - Demand driven scheduling strategy corresponds to the behavior where a single process is run whenever its data is needed. If we consider the scenario given in above fig. 2.1 then process P1 will only be executed when P3 requires data and process P2 is executed whenever process P3 or process P4 require data. Such behavior could lead to global artificial deadlocks in the case of the local deadlock of a single process. Suppose P3 is blocked in a local deadlock or does not need any data, so demand driven scheduling would not execute process P2 which subsequently stop independent process P4 to occur.

(ii)Data driven scheduling:- This scheduling strategy was developed to mitigate the limitations of demand driven strategy. Data driven scheduling runs processes whenever they are ready. Here, the local deadlock problem is avoided but tokens would be accumulated on arcs creating a problem of memory consumption requirements.

In general, KPN is determinate, i.e. regardless of the scheduling policy employed, for a given input set; the output will always be the same. This characteristic makes it suitable for designing multicore systems. This gives a lot of scheduling freedom that can be exploited when mapping process networks over various

(14)

multi-core platforms [2]. The main drawback of the KPN is that it requires a dynamic scheduling with runtime context switching as well as dynamic memory allocation.

2.2.1.2. Synchronous Data Flow

KPN requires dynamic scheduling with runtime context switching and dynamic memory allocation. This makes it hard for practical implementation. Synchronous Data Flow (SDF) specification addresses theses shortcomings. SDF is an extension of traditional design flow or data flow models. In Data Flow models, processes are broken down into atomic blocks of execution called actors. The actors are executed once it receives all the required input tokens to avoid context switching in between a running process. On every execution each actor consumes a required amount of tokens and generates a resulting amount of tokens.

Fig. 2.2. Synchronous Data Flow

In SDF, the amount of tokens consumed and generated by an actor at each firing is fixed. Hence, it can be said that the amount of data flow or control flow in an SDF is predetermined and cannot be changed based on any runtime scenario. As a result static SDF are bounded in nature and the required buffer size for the communication channels are known before runtime.

Fig. 2.2 shows an example of SDF with four actors a, b, c, d. On every execution, a produces two tokens, one of which is consumed by b. b produces two tokens, one of which is consumed by c. c produces one tokens and sends it to d. Finally, d consumes and produces two tokens on both its input and output links. The graph is initialized by putting two tokens on the arc between c and d. To schedule such a graph we first have to produce a set of linear equations (Balanced Equations) to determine the relative execution rates of the actors with respect to other actors.

(15)

2a = b 2b = c 2b = d 2d = c

Which implies we have to execute b twice to execute c once, d twice to execute b once, d twice to execute c once and b twice to execute a once. Finally the system of linear equations reduces to,

4a = 2b = c = 2b

Picking the solution with the smallest rates, we have to execute c four times to and b and d each two times for every execution of a. If the equations are inconsistent or not solvable other than setting all rates to zero, the SDF graph can not be statically scheduled or would otherwise lead to accumulation of tokens on the arcs. After calculating the execution rates we can generate a schedule by simulating iteration until its initial state is reached back. If deadlock occurs in between any iteration then token can be placed on any arc to resolve the deadlock. For fig. an example of execution order is adbccdbccFor this schedule the number of tokens accumulate at any time is a maximum of two at each arc and the total memory requirement of 8 token buffers.

Though SDF provides a set of significant advantages over KPN like static scheduling, no expensive runtime and context switching, it has its own limitations such as the SDF model cannot express conditional execution of a block [2].

2.2.1.3. Process Calculi

Process calculus gives a high level formal description of interactions, communications and synchronization mechanisms among concurrent processes [3]. Formal description is presented as a set of processes, composition rules and axioms. Specific composition can be of two types, parallel composition and sequential composition. Furthermore, notion of recursion and replication are also supported.

Process calculi models are suitable for analysis, equivalence checking and formal verification because of the abovementioned characteristics of formalization and restricted execution.

(16)

2.2.2. State based Models

State based models [3] are described in terms of state machines and consists of a set of states and transitions between those states. State based models put more emphasis on explicitly showing control flow. Typically, states explicitly represent the memory state of the program. The difference of SDF from process oriented MoC, is that it is mainly used in control dominated applications.

2.2.2.1. Finite state machine (FSM)

FSM is the basic model in the computer science for modeling various types of applications. It is defined as a quintuple <S, I, O, f, h>, where S represents the set of states, I and O set of inputs and outputs respectively, f: S × I → S the next state function and h the output function [3]. Traditional FSM are sequential i.e. they can be in one state at a time. Therefore, for every new input a new state should be created which sometimes will lead to a large number of states to model a considerably large system. To solve this problem extension of traditional FSM like FSM with data(FSMD) and hierarchical and concurrent FSM (HCFSM) are evolved.

2.2.2.2. Hierarchical and concurrent Finite State Machine (HCFSM)

Hierarchy and concurrency are further techniques to handle the complexity of a system In hierarchical models, the concept of super states [3] has been introduced. Each super state can be a standalone FSM. Entering a super state is equivalent to entering the start state of the FSM within it. Whenever a super state finishes its execution and exits, the parent FSM transitions to another super state.

Concurrency on the other hand, breaks complex state machines into multiple simpler FSMs running in parallel. These FSMs can communicate with each other through shared channels, variables and firing events.

2.2.2.3. Program State Machine

Program state machines (PSM) [2] can be seen as a combination of KPN and HCFSM. Therefore, PSM make use of the asynchronous execution of the KPN model as well as the notions of hierarchy and control from the HCFSM model.

(17)

Fig 2.3 Program State Machine

As in Fig. 2.3 Processes run asynchronously to each other and can execute concurrently or sequentially. Concurrent processes communicate through message-passing channels. Processes run asynchronously to each other and can execute both concurrently and sequentially. Concurrent processes communicate through message-passing channels that incorporate FIFO buffers in order to provide the asynchronous communication. Such message passing channels are good to separate communication details from computation.

(18)

3. RESOURCE MANAGEMENT OF MULTICORE SYSTEMS

Improved hardware resource utilization is an important aspect in multicore architectures. These shared resources include off chip bandwidth, share memory, power etc. Utilizing better resource management strategies improves the performance and allows for smaller die areas and simpler batter technologies. Shared resources mostly have to do with the memory hierarchy. In following sections we discuss the memory contention problem and the memory management strategies.

3.1. Shared Memory Contention in Multicore Systems

Figure 3.1 provides an illustration of a system with two memory domains and two cores per domain. In this system, threads in cores within a same memory domain can compete for shared resources. This result in significant performance degradation compared to the performance a thread could achieve in a contention-free environment. It has been documented in previous studies that execution time of a thread can vary significantly depending on whether threads run on the other cores in the same chip or not [19, 20]. This becomes especially true in cases where cores share LLCs. For example, in Figure 3.1, core 0 and core 1 are competing for the same LLC while core2 and core3 also compete for the shared LLC.

Figure 3.1: A Multicore System with Two Memory Domains

When a thread issues a cache request for a line that is not already there in the cache (i.e. caches miss) a new cache line must be allocated to bring the requested line. This becomes a problem if the cache is full when this request arrives because some cache line needs to be evicted to free up the space to bring the new line. It is quite possible that the evicted line belongs to a different thread from the one that issued the cache request thus degrading its performance Modern CPUs do not assure any fairness in this regard.

(19)

Figure 3.2 [23] shows the cache sensitivity under LRU insertion policy of two SPEC CPU2006 workloads. When both these workloads execute concurrently and share a 2MB cache, soplex, a streaming application, interferes with h264ref. Cache performance can be improved by reducing the interference.

Figure 3.2: The Shared Cache Problem. 3.2. Shared Memory Management Strategies

Many researchers have proposed strategies to conquer the problem of resource contention. These strategies fall into two main categories: cache partitioning strategies and contention aware scheduling.

3.2.1. Cache Partitioning Strategies

Cache partitioning is one of the popular ways to effectively utilize shared resources between cores with minimum contention. In cache partitioning, shared caches such as L2 and L3 caches are usually partitioned among threads that are running simultaneously in multiple cores. Many multicore processors today still use cache designs from uniprocessors. However, many cache partitioning methods have been proposed in the recent past. These focus on a variety of optimization objectives including performance, fairness and quality of service. Cache partitioning strategies proposed to date can be broadly classified to static partitioning schemes and dynamic partitioning schemes.

A. Static Cache Partitioning Strategies

In static cache partitioning, the cache is partitioned among multiple threads from different cores statically. Once defined, the sizes of the cache partitions do not change. Below we discuss some static cache partitioning schemes that are more aligned with the topic of this survey.

(20)

(i) Optimal Cache Partitioning

In optimal cache partitioning [21] authors propose a new method for optimal allocation of cache memory among competing processes. The authors of the paper focus on two main problems. The first problem is the allocation of interlaced data and instruction processes to cache memory. Authors develop model for a simpler modified LRU replacement strategy and use this model to obtain a model for pure LRU replacement. This modified LRU strategy provides better results in certain circumstances. In this work, the overall miss rate of a cache memory is used as the measure of optimality and an optimal partition is defined to be a partition of cache among competing processes or threads that achieves the minimum miss rate. The second problem is how to allocate memory among processes in a multiprogramming environment. Here, there exists a cache reload transient time associated with the event of a new process taking over the processor. During the early part of this cache reload transient time, the miss rate goes up and eventually goes down when the working set is brought to the cache.

Let us now look into how the first problem is solved: the allocation of cache memory between data and instruction streams. In this problem authors consider interlaced instruction and data streams that have different cache behaviors. They show that in this idealized setting, optimal allocation occurs at a point where the competing processes’ miss rate derivatives are equal. Employing fully associative search in search or replacement algorithms in faster memories such as caches, is infeasible because the high complexity of maintaining LRU information. Therefore, a simpler approach called approximate LRU replacement is taken where a search is performed on a small set of items and replaces the least recently used one in this set if necessary. For slower memories in the memory hierarchy however, fully associative search can be used and therefore true LRU replacement is employed.

The main focus of this work is the miss rate as a function of cache allocation of competing processes. The miss rate is assumed to be a function of cache allocation size. For fully associative caches, the miss rate for a given reference stream is a single parameter function and depends only on the number of lines allocated to a process. This is because the entire cache is searched for a match during a cache lookup. For set-associative caches, the miss rate depends not only on how many lines are allocated to a process, but also

(21)

where they are located in the cache, since only a set of a few lines is actually searched during a lookup. In this work, authors use a simplified model of set-associative caches. In this model, the miss rate is modeled as a single parameter function where the parameter is the number of lines allocated. The physical distribution of these lines of a given set in the cache is considered irrelevant and assumed to have no affect on cache miss rate. This assumption necessarily allows us to treat both set-associative caches and full associative caches in the same manner and therefore also allows us to apply the proposed model to both types of caches. Let us examine the processes that generate the cache references. Assume that an address-reference stream is composed of two interlaced streams of addresses. One stream consists of instruction fetches, and the second stream consists of data fetches. The composite stream is an interleaving of the two streams so that its address references alternate between data and instructions. That is, the stream has the form I , D, I , D, . . . , where I and D are instruction and data references, respectively. Each component stream has a known cache behavior given by a miss rate for that stream as a function of the cache memory allocated to the process. Let MI(x) be the miss rate for the I stream as a function of cache size z, and, similarly, let MD(x) be the miss rate for the data stream. We assume that both the instruction and data processes are stationary in time, so that the miss rates are not time varying functions. Now we need to determine the optimal fixed allocation of cache for the I and D streams. For this, we find an expression for the misses in a period of time that has exactly T references, and find an allocation at which the derivative of the miss rate function goes to zero. The total number of misses in a time period with T references is the composite miss rate times the length of the period. Since we assume that I and D references occur with equal frequency in the interval T, the total number of misses is given by:

To minimize the overall miss rate, the authors minimize the total misses given in (1) by setting the derivative of the right-hand side of (1) to 0, which occurs at a value of x that satisfies:

(22)

Authors also show that a conventional LRU-replacement policy has a most probable state that is not the optimal allocation of memory between the I and D references streams, but it is capable of producing very good allocations.

(ii) Multi-Queue

Authors of [22] propose cache management scheme that organizes the cache set into multiple FIFO queues. In a FIFO, each entry corresponds to a single cache line. All of the lines in a single cache set is the collection of all entries in the cache. Figure 3.3 (a) depicts the organization for a single core. As given in the figure, there are two levels of caches and an LRU cache area. To get into the LRU based cache region, an item must pass through first level cache and then the second level cache. Once inserted into a FIFO, each queue entry has a u-bit associated. This u-bit is reset on an entry to any one of queues and updated at every reuse. For a queue of size Q, after Q more insertions, the original line leaves the queue. If its u-bit is still zero, then the cache immediately evicts the line. If the u-bit is one, then the line inserts itself into next-level FIFO. This organization allows one to evict rarely used lines earlier than for a shared LRU region. The modified organization for a multicore environment is shown in Figure 3.3 (b). Here, each core is assigned a private first level FIFO. The second level FIFO and LRU managed cache region is shared among cores.

Figure 3.3 Logical organization of the basic multi-queue cache management scheme for (a) a single core and (b) multiple cores.

(23)

The MQ approach is targeting at solving three problematic cache behaviours. First problem is the cache lines that are inserted into the cache but never referenced. There is considerable number of such cache lines at a given time in a cache. If the conventional LRU replacement policy is employed, for a w-way cache, only after w insertions this cache line gets evicted. By occupying space in cache, such cache lines reduce the hits. In MQ approach however, after only Q<w insertions for a Q-entry first level FIFO, such lines gets evicted thereby reducing the residency time of no-reuse lines. The second problem is cache bursts which results from temporal locality. But after initial burst, such cache lines are never reused until eviction. Therefore, only after w more insertions will such a line is evicted. The third problem is the isolation of different cores. A cache with high access rate can preempt quickly cache lines of another core sharing the same cache.

Figure 3.4 (left) shows a two core example for a cache that is managed using LRU policy. Here, core 0 inserts one line while core 1 which has a rapid access rate, inserts a large number of cache lines successively. This results in eviction of core 0’s inserted line and other useful lines residing in the cache. Subsequent accesses to these evicted lines will result in cache misses. To solve this problem, MQ introduce a dedicated first level queue to each core, which will isolate traffic of a given core. As shown in Figure 3.4 (right), when a line is inserted into core 0;s first level queue, it is protected from eviction by other cores. Also, since Core 1’s lines show no reuse, then these lines are evicted as soon as they are dequeued from Core 1’s first-level queue. Other cache lines with proven reuse (those with the striped patterns in the figure) are maintained in the cache in the second-level queue or the LRU region. This necessarily allows occurrence of additional cache hits.

(24)

Figure 3.4 (left) Two cores where Core 0 reuses its lines and Core 1 does not, but Core 1 accesses the cache at a much faster rate. (right) The same scenario when Core 0 and 1 have their traffic isolated using multiple queues.

B. Dynamic Cache Partitioning Strategies

Dynamic cache partitioning is more successful than static partitionining as this allows partitioning the cache among cores at run time based on need. A recent study [27] showed that dynamically changing the insertion policy can provide high-performance cache management for private caches at negligible hardware and design overhead. Here we describe some of the popular schemes.

(i) TADIP

Thread-Aware Dynamic Insertion Policy (TADIP) [23] is an adaptive insertion policy of shared caches among competing threads running in multiple cores. TADIP aims at achieving four goals: high performance, robustness, scalability and low design overhead. Number of cores and concurrent threads is expected to increase in future processor designs. Therefore, TADIP aims at providing a insertion mechanism that is scalable with number of cores and threads while producing minimal affect on performance where LRU policy works better. TADIP is also is a negligible hardware overhead mechanism to manage shared caches.

(25)

The Dynamic Insertion Policy (DIP) proposed in [27] primarily use two policies: the Bimodal Insertion Policy (BIP) and LRU policy. In BIP inserts majority of incoming cache lines in most recently used position while selectively inserting few in least recently used position. The set dueling principle is used in DIP to dynamically select the best policy for that particular time. Set dueling monitors are used to keep track of the misses generated by each policy. For this, specific set of caches are dedicated to run either of the policy. The wining policy which is the policy with least misses is always followed by the rest of the cache sets. However, DIP does not take into account the possibility of multiple threads competing for a shared cache. In TADIP, authors propose an extension to DIP which addresses the shortcomings in DIP. Authors show that TADIP outperforms traditional LRU policy and leaves significant room for improvement. TADIP needs to make a binary decision between LRU and BIP for each competing application running on cores. For N concurrently executing applications sharing a cache, the search space is considered a N-bit binary string and therefore there are 2N possible strings and the best performing string needs to be selected. When N is small enough, the set dueling principle can be used to select the best performing string. However, the problem is the number of set dueling monitors needed exponentially increases with increasing number of competing applications running concurrently. Therefore, authors propose two scalable approaches that solve the problem of exponential increase. The exponential search space is reduced to a linear search space by filtering out applications that do not benefit from cache using BIP by individual learning of the best insertion decision for each application.

TADIP provides two main flavours: TADIP-I and TADIP-F which improves upon TADIP-I. TADIP-I stands for TADIP-Isolated. It tries to learn the insertion decision of each competing application in isolation. N+1 Set Dueling Monitors (SDMs) are used for this purposed for the N applications competing for the same shared cache. Each SDM learns the insertion policy of each application in an independent manner. The baseline SDM (i.e. the first SDM) use a fixed policy, namely the traditional LRU policy for one application. Remaining N SDMs (called bimodal SDMs), use BIP for one application and LRU policy for others. Figure 3.5 (b) depicts TADIP-I scheme for a cache shared by 4 applications. Given a binary string <P0,P1,P2,P3>, the insertion policy for Application 0 is P0, Application 1 is P1, and so on. Bimodal insertion policy (BIP) is used when Px is 1, otherwise the LRU policy is used. Px is the MSB of a policy

(26)

selection (PSEL) counter. Both TADIP schemes require a per-core PSEL counter. As depicted in the figure, the baseline SDM uses <0,0,0,0> binary string which indicates that all applications use LRU policy. The four bimodal SDMs use the binary strings <1,0,0,0>, <0,1,0,0>, <0,0,1,0>, and <0,0,0,1> respectively.

Figure 3.5: Adaptive Insertion Managed Shared Caches. Three schemes for managing a cache shared four by 4 applications. (a) DIP (b) TADIP-Isolation (c) TADIP-Feedback. Set Dueling Monitors (SDMs) estimate misses for a given policy and follower sets use the best performing policy.

The PSEL counters associated with each application is used to select the best insertion policy. All PSEL counters for each application are incremented whenever the baseline SDM encounters a miss. On the other hand, bimodal SDMs decrement only the associated application PSEL counters in a miss. However TADIP-I has one major problem: the insertion decision of an application might be dependent on another application’s insertion decision. To avoid problems in TADIP-I, TADIP-F, i.e. TADIP with feedback, is introduced by the authors. To learn the winning insertion policy, TADIP-F uses 2N SDMs per each application. Figure 3.5 (c) depicts TADIP-F scheme for four applications sharing one cache. For four applications, eight SDMs are used, two per each application.

(ii) UCP

Utility Based Cache Partitioning (UCP) [49] is a low overhead and high performance portioning scheme for shared caches. UCP is based on the idea that utility of a cache varies widely based on the application. If

(27)

two applications having low utility are executed together, then their performance is not sensitive to the amount of cache available to each application. On the other hand, if two applications executing together are having saturating utility, the cache is capable of supporting the needs of both applications. However, when two applications with low and high utilities execute together, it is possible that the working set of the high utility application is not kept. Therefore, it is important to partition the cache among applications by taking into account the utility of each application. Authors provide this by quantitatively defining cache utility for a single application on a way basis the cache is allocated. For an application that has misses, missa and missb for a and b ways, the utility of of increasing numbr of ways from a to b ( ) is defined as follows:

Figure 3.6: Framework for Utility-Based Cache Partitioning.

Figure 3.6 the proposed UCP framework. Figure depicts two applications that that compete for the shared L2 cache in a duel core system. Each application resides in the two separate cores. A utility monitor circuit (UMON) is dedicated to each core for the purpose of monitoring of the application running in that core. To allow the UMON circuit to obtain utility information for all the ways in cache, it is implemented separately from the shared cache. Based on these utility information collected, the partition algorithm decides the number of ways to allocate to each core. To do the utility monitoring effectively, there needs to be a method for monitoring number of misses for all possible ways of cache. For example, for a 8-way cache, the UMON should track misses for 8 possible ways from only one way allocated to an application to all 8 ways allocated to an application. The brute force approach is to maintain 8 tag directories, each with the

(28)

hardware overhead for multiple directories makes this approach impractical. However, the baseline LRU policy obeys the stack property. i.e. access that hits in n-way cache also hits in mare than n-way cache. Therefore it is possible to compute hit and miss information about all possible ways with a single tag directory.

For an n-core system, n UMON circuits are needed, one per each core. To reduce the hardware overhead of UMON, the authors use Dynamic Set Sampling (DSS) which allows to approximate the behaviour of the cache using only few sets. The hit counter information in UMON is approximated using DSS.

(iii) Dynamic partitioning of shared cache memory

[48] presents a dynamic cache partitioning algorithm which allocate cache among simultaneously executing processes such that overall cache misses are reduced. This scheme dynamically estimates each processes gain or loss based on set of online counters in different cache allocations in terms of number of cache misses. Then based on this estimate, the cache allocation to processes is dynamically changed so that processes at higher losses are allocated more cache space. For N concurrent processes competing for a shared cache of C blocks, with partitioning on a block basis, the problem is to partition the cache into N disjoint subsets of cache blocks in a such a way that minimize number of misses. The partition is fixed over a given amount of time T.Given ci is the number of cache blocks allocated to the ith process over T, the cache partitioning among processes is specified by the number of cache blocks allocated to each process: {c1,c2,c3,..,cN}. Also give that mi(c) is the cache misses for ith process over T as a function of partition size, the optimal partition for time period T is given as { c1,c2,c3,..,cN}that will minimize total misses over T as given below:

Given that where, C is the total number of cache blocks. Authors define the marginal gain of a process gi(c) as:

(29)

which is simply the derivative of miss curve mi(c) at a given cache space c. This marginal gain represents the total cache misses that can be reduced using one extra cache block. Therefore it indicates the benefit of increasing the cache allocation from c to c+ 1 block for a process. These marginal gains need to be calculated online for different cache sizes. Authors use a set of counters for this purpose. For a fully-associative cache with C blocks, it is possible to compute g(c) over a time period T on-line using C counters. Computing the marginal gain simply follows from a set of counters. The marginal gain g(c) is obtained directly by counting the number of hits in the c+1th most recently used block.

3.2.2. Contention Aware Scheduling Strategies

(i) Cache-Aware Scheduling for Multicores ( )

[50] presents a scheduling strategy for competing tasks at multiple cores with timing and cache allocation constraints. The advantage of this method is that it allows each task to use a fixed number of cache partitions. This essentially allows a cache partition to be used by only one task at a given time. Therefore, the cache space allocated for tasks are isolated. Authors assume that there exists cache partitioning algorithms which can divide the shared cache space into non-overlapping partitions which allows independent use by tasks, as shown by Figure 3.7 (a).

Figure 3.7: Cache space isolation and page coloring

(30)

follows: A task is defined as where, Ai is the cache space size, Ci is the worst-case execution time (WCET), Di <=Ti is the relative deadline for each release, and Ti is the minimum inter-arrival separation time also referred to as the period of the task. Authors assume the tasks are ordered by priorities. The utilization of a task is defined as and the slack is defined as . Slack is the maximum delay allowed before missing a deadline.

The authors basically focus on a much simpler non-preemptive fixed-priority scheduling (FPCA) as it is difficult to predict overhead of each task due to preemption. This scheduling algorithm is triggered as a result of a job completion or job arrival. Given that there are enough resources available, the highest priority jobs are scheduled for execution. More specifically, a job Ji is scheduled for execution if following conditions are true:

1. Ji is the job of highest priority among all waiting jobs, 2. There is at least one core idle, and

3. Enough cache partitions, i.e. at least Ai, are idle.

There is at most one job of each tasks due the assumption of .

Figure 3.8: Example for illustrating scheduling algorithm

An example of the task set is shown in Figure 3.8. Table 1 in figure shows the tasks scheduled by . At time 0, cannot execute due to the constraints in . Even though is ready to execute, and a

(31)

free idling cache partition is available, a higher priority job that is needed to be in execution. However there is not enough idle cache partition available for this high priority job to execute. This is a limitation in algorithm: a low priority job is not scheduled even with enough resources when a high priority job needs to wait thus wasting valuable resources. This type of scheduling is called blocking-style scheduling. In contrast, a scheduling policy that allows executing low priority ready job prior to high priority ready job which has not enough idle cache space is called non-blocking-style schedule.

(ii) Cache Aware Multicore Real-Time Scheduler

[51] describes a cache aware soft real time scheduler that will reduce cache miss rate. Authors assume the system is modeled as a set of multi-threaded tasks (MTTs). Each MTT has a set of sequential tasks each with a common period. They may have different execution costs. The MTTS are used to specify groups of cooperating tasks referencing a common set of data. MTTs allow concurrency within task models that typically handle only the sequential execution of tasks. Processing power of each core is likely to remain the same as per-chip core counts increase. Therefore, MTTs should be useful for achieving performance gains. Authors use G-EDF scheduling as a baseline for evaluating the performance of proposed cache-aware scheduler. In G-EDF scheduling, jobs are scheduled in order of increasing deadlines, with ties broken arbitrarily. G-EDF is not an optimal scheduling policy,. Therefore, it is likely that tasks may miss their deadlines. It has been proven by research that this latency is bounded. G-EDF as used as the scheduling heuristic in this system. Here, per job working set size(WSS) per each MTT indicate the cache impact.

Authors also provide a profiler that gives per-job-WSS for each MTT. This is used for scheduling using the heuristic. The reason for profiling MTTs rather than tasks is because MTTs share a common working set. This profiling happens online while the job execution is taking place. Performance counters are used by profiler to record MTT. Each core is associated with a set of performance counters. These can be used to monitor events originating from the core. Authors set each counter to track shared cache misses. Jobs are executed sequentially. This allows to track the number of cache misses by resetting the counters to zero at

(32)

beginning of execution. The misses are read at the end of execution of job sequence. It is these misses that authors use to calculate per-job WSS.

(iii) Symbiotic Job Scheduler (SOS)

[52] introduce a job scheduler called SOS for Simultaneous Multi Threading (SMT) architecture. This work is based on the fact that performance on hardware SMT processor is sensitive to the set of jobs that are co-scheduled by the operating system scheduler and to get the benefit of SMT environment the scheduler should have the intelligence to identify interaction between competing threads. The SOS scheduler is proven experimentally to improve performance of SMT architecture significantly. One significant advantage of this method is that SOS does not assume any prior knowledge of workload characteristics. Instead, sampling techniques are used to identify threads that minimize contention for shared resources.

SOS scheduler run jobs in groups same as the SMT level. The jobs are grouped based on some selected fair policy that will allow jobs to make progress. The SOS scheduler first enters a sample phase. SOS scheduler in this phase permutes the scheduled periodically. This essentially change the jobs co-scheduled into groups periodically. During this sample phase, SOS scheduler collects dynamic execution profiles of executing jobs using the hardware performance counters. After sampling the performance of several schedule permutations, SOS selects as the best schedule the one selected to be optimal and run it for the rest of the time until jobs are completed.

Authors also define a measure of the goodness (i.e. speedup) of a co-schedule. Intuitively, if one job schedule executes more useful instructions than another during the same time interval, the first job schedule is decided to be more symbiotic and show higher speedup. This essentially suggests IPC (Instructions per cycle) is a good measure of speedup. However, the problem with this approach is that an unfair schedule can show good speedup, at least for a while, by favoring high-IPC threads. Therefore to ensure that SOS is measuring real increases in the rate of progress through the entire job mix, authors define the following measure:

(33)

where WS(t) is the contribution of each thread to the sum of total work completed in the interval by dividing the instructions executing on each job's behalf by its natural offer rate if run alone.

3.3. Power Management in Multicores

Today, many embedded systems such as mobile phones and PDAs are designed using complex multicore SoC platforms. For such multicore platforms, sitable power management techniques need to be devised.

The simplest method of power managing multicore chip is to simply apply well-known single core techniques to every core. The problem with this approach however is that this is inefficient because it cannot take advantage of peak power averaging effects that occur across multiple cores. Therefore much research has been conducted in this area [53, 54, 55] to provide improved power management techniques. Here we discuss some of such important techniques for power management in multicores.

Figure 3.9: Real temperature of one core on running bzip2 benchmark [53]

(i) Predictive Dynamic Thermal Management for Multicore Systems (PDTM)

[53] describes a Dynamic Thermal Management (PDTM) based on Application-based Thermal Model (ABTM) and Core-based Thermal Model (CBTM) in the multicore systems. Per each core, PDTM use an

(34)

prediction model makes use of both core temperature and applications temperature variations. Based on the predictions, appropriate actions are taken to avoid thermal emergencies. In this work, the prediction model is called DBTM. Using the application thermal behaviour, ABTM predicts future temperature. On the other hand CBTM uses steady state temperature and workload to estimate the core temperature pattern. This temperature prediction model and the with the thermal-aware scheduling mechanism has been implemented by the authors on a real four-core product under Linux environment. The experimental results on Intel’s Quad-Core system running two SPEC2006 benchmarks simultaneously shows the proposed PDTM lowers temperature by about 5% in average and reduces up to 3 % in peak temperature with only at most 8% performance overhead.

Figure 3.9 depicts the rapid temperature changes even when the workload is statically 100%. To predict future temperature in fine granularity, ABTM use short term thermal behavior. ABTM first derives the thermal behavior from local intervals (i.e. short term temperature reactions) and then predicts the future temperature by incorporating this behavior into a regression based approach that is known as the Recursive Least Square Method. In the general least-squares problem, the output of a linear model y is given by the linear parameterized expression.

where, u = [u1 ,u2 ,· · · ,un ] is the model’s input vector, f1,...,fn are known functions of u, and θ1, θ2,..., θn are unknown parameters to be estimated. Using this equation, ABTM can predict future temperature for an application as shown in Figure 3.10.

(35)

(ii) Dynamic Multicore Power Management (DPM)

To address the problem of dynamic power management, authors of [54] propose a formal verification model of DPM scheme. This model incorporates probabilistic model checking to estimate the required verification effort. This model checker is capable of providing information on how certain design parameters impact this effort.

Figure 3.11 depicts the current industrial workflow in the DPM scheme. The unshaded portion shows development of the new DPM. This workflow is however is prone to missing bugs. This new DPS method addresses the above concerns by introducing an additional, early step in the development of a new DPM scheme. The shaded portion shows this added step in Figure 3.11. The purpose of this additional step is top create a high-level model of the proposed power management policy at an early design stage. Then the probabilistic model checker is used to verify this high level model for efficiency and safety. Probabilistic model checking is an exhaustive formal verification method. The advantage of performing a high-level verification early in the development process is that problems can be identified at an early stage when they are easier to solve. Also, a high-level model is much easier to develop and modify than a detailed simulator and various design can be verified quickly.

(36)

To estimate the effort required to verify the DPM scheme is measured as number of reachable states and transitions. This is estimated using the model checker. Model checker provides a better understanding of the impact on verification effort of scaling certain design parameters. However, it should be noted that model checking does not eliminate the need to later simulate a detailed implementation of the DPM scheme. However, it can catch bugs early and help the simulation reach desired state coverage goals.

(37)

4. PARALLELIZATION WITH MULTICORES 4.1. Background

Parallelization is the most significant feature of multicore and it is also the motivation behind the development of multicore processors. In the evolution of high performance multicore processors, multithreaded CPUs were the stepping stone. In multithreaded system hardware-level context switching between threads is used to reduce the idle time of. Shortly after, designers integrated more than one processor core onto a single chip. Eight core processors are common these days, with forecasts for CPUs with more and more cores becoming available in the near future. Assuming that Moore’s Law holds, we expect a doubling of the number of cores on chip every two years, leading to many-core CPUs (16 or more cores) just over the horizon.

Multithreaded and multicore CPUs both exploit concurrency by executing multiple threads, though their designs target different objectives. Multithreaded CPUs support concurrent thread execution by issuing instructions from multiple threads. Multicore CPUs achieve thread concurrency by increasing scalability via replicating cores. These CPUs are often called CMP which stands for Chip Multi-Processing.

Most recent CPU and GPU (Graphics Processing Unit) designs like the Sun UltraSPARC T2, IBM POWER6, Intel Xeon, ATI RV770, and NVIDIA GT200 combine both of these design options and have multiple multithreaded cores. [24]

4.2. Design spectrum of parallelization

The performance of these parallel systems depends upon a wide range of design spectrum as given below [24]:-

• Multithreaded Cores

All multithreaded cores keep multiple hardware threads “on-chip”, ready-for-execution. This is necessary to minimize context-switching cost. Each on-chip thread needs its own state components such as the instruction pointer and other control registers. Thus, the number of on-chip threads determines the number of required of state components to be replicated and subsequently the maximum degree of hardware-supported concurrency.

(38)

There exist a variety of approaches of switching between threads per core, which range from alternating threads to actually issuing instructions from several threads each cycle. First one is called Temporal Threading (TMT). Most current CPUs employ the later approach which is called Simultaneous Multi-Threading (SMT), one of the most common example of SMT is HyperMulti-Threading Technology (HTT) by Intel.

• Multicore CPUs

Hardware multithreading per core has limited scalability whereas multicore CPUs are more promising for scalability. Most early multicore chips were constructed as a simple pairing of existing single-core chips as in the Itanium dual-core [24]. Like their predecessors, these chips replicate only the control and execution units and share the remaining units per chip. However, sharing has disadvantages regarding contention on the shared resources. The current trend is toward replicating more components such as memory controllers and cache to reduce the contention and communication overhead.

• Integration of on chip components

The number and selection of integrated components on chip is an important design decision. Possible components to include on-chip are memory controllers, communication interfaces, and memory. Placing the memory controller on-chip increases bandwidth and decreases latency. Some designs support multiple memory controllers to be integrated to make memory-access bandwidth scalable with the number of cores. Integrating a GPU core on chip is another promising technique and might become common in next generation multicores. IBM’s Blue Gene/P [25] system relies on a highly integrated system-on-a-chip design which features four cores, five network interfaces, two memory controllers and 8MB of L3 cache. Because each Blue Gene compute node is so highly integrated, the system scales to hundreds of thousands of processors.

• Shared vs. Private Caches

Aside from concurrency, caches are the most important feature of modern CPUs to enhance performance, due to the performance gap between CPU speed and memory-access times. For multicore the organization of cache memory also play an important role. Most current multicore-chip designs have a private L1 cache per core to reduce the amount of contention for this critical cache level. The assignment of L2 cache in multicore designs varies. L3 cache was historically off-chip and shared with some exceptions. Whether