• No results found

Multiprocessors and Multicore Platforms

2.1 Hardware Foundations

2.1.1 Multiprocessors and Multicore Platforms

The term “multiprocessor” encompasses a wide range of system architectures. This dissertation applies to shared-memory, uniform memory access, identical multiprocessors—that is, systems in which task migrations are conceptually viable and for which the choice of scheduler is not pre- determined. In the following, we briefly discuss the meaning and significance of the attributes “shared-memory,” “uniform memory access,” and “identical.”

A multiprocessor consists of multiple, independently controlled processing units that com- municate via a processorinterconnect(Tanenbaum, 2005). There are two fundamental classes of multiprocessors that differ in the nature of the interconnect. In ashared-memorymultiprocessor,

shared memory Proc. 1 Proc. 2 Proc. 3 memory bus local memory Proc. 1 Proc. 2 Proc. 3 message bus local

memory memorylocal

shared-memory multiprocessor distributed-memory multiprocessor

Figure 2.1: Illustration of shared- and distributed-memory multiprocessors. In a shared-memory system, a memory bus enables access to a shared central memory, whereas processors communicate via a message bus that does not allow remote memory access in a distributed-memory system. This dissertation pertains to shared-memory multiprocessors.

there is a central memory that is accessible to all processors and processors are connected to each other and the central memory by means of a sharedmemory bus. In contrast, in adistributed-memory

system (ormulticomputer), there are multiplelocalmemories that are accessible to only a subset of the processors (historically one). Processors are still connected to each other in a distributed-memory system, but only via amessage busthat does not allow direct access to non-local memory. Figure 2.1 illustrates the difference on a conceptual level. In practice, interconnect implementations are usually much more refined than a simple “bus” since they are crucial to a multiprocessor’s effective speed and capacity; Baer (2010) provides a detailed introduction to multiprocessor interconnects and Bjerregaard and Mahadevan (2006) survey high-performance, on-chip switched networks.

From a scheduling point of view, the main difference between shared- and distributed-memory architectures is howprocess migrationis implemented,i.e., how a process may commence execution on one processor and, transparently to the process, continue execution on another processor. When a process is migrated in a shared-memory system, only its hardware state (such as register contents) must be transferred since its data (including its OS state) is accessible from all processors. In contrast, the data of a migrating process must becopied from one local memory to another in a distributed-memory system. Copying process state is challenging to implement at the OS level and can create considerable load on the communication bus (see Milojiˇci´cet al. (2000) for an overview

of process migration techniques and challenges). Consequently,frequentprocess migrations are impractical in distributed-memory systems, which precludes clustered and global scheduling policies.

Shared-memory systems further differ with regard to memory access times. Ideally, a processor should be able to access each memory location with the same maximum latency. Such systems are calleduniform memory access(UMA) architectures. However, completely centralized memory systems can quickly become a bottleneck in practice because only one or few processors can access the central memory at a time. Hence, the available memory may be split across several modules, which may result in a non-uniform memory access (NUMA) architecture where some memory modules are closer to a particular processor than others. To achieve maximum efficiency in NUMA systems, a task should be scheduled on a processor “close” to its data, which precludes global scheduling.

This dissertation applies mainly to UMA multiprocessors; the impact of accommodating NUMA constraints is briefly discussed in Chapter 8.

Processor symmetry. In the scheduling literature, shared-memory multiprocessors are further clas- sified based on the capabilities of their constituent processors,e.g., see (Brucker, 2007; Funk, 2004). Inidenticalmultiprocessors, there are no differences among processors, that is, a task’s execution is not affected by the processor’s identity. Identical multiprocessors are also commonly referred to assymmetric multiprocessors(SMPs). Inuniform heterogeneousmultiprocessors, processors differ in speed but have otherwise equal capabilities. In this case, the execution requirement of tasks assigned to a slower processor is scaled up proportionally to its speed. A widespread example for uniform heterogeneous multiprocessors is systems that support per-processor frequency scaling as a power-saving measure. Finally, inunrelated heterogeneousmultiprocessors, each processor may have special capabilities (such as application-specific co-processors) such that execution time requirements are both processor- and task-specific. The three classes of multiprocessors are illustrated in Figure 2.2.

We restrict our focus to identical multiprocessors in the main part of this dissertation and briefly consider how our results apply to the heterogeneous cases in Chapter 8.

Multicore vs. multithreading vs. multiprocessor. As mentioned in the introduction, interest in multiprocessor real-time scheduling is driven in large part by the widespread emergence of multicore

I/O coproc. 500 MHz 500 MHz FPU 2 GHz FPU 2 GHz FPU 2 GHz FPU

Proc. 1 Proc. 2 Proc. 3

Identical Uniform Heterogeneous 2 GHz FPU 1 GHz FPU Unrelated Heterogeneous 1 GHz FPU 3 GHz large cache

Figure 2.2: Illustration of identical, uniform heterogeneous, and unrelated heterogeneous multi- processors. No differences exist among processors in identical multiprocessors,e.g., here, each processor has a speed of 2 GHz and a floating point unit (FPU). Processors differ in speed but not capabilities in uniform heterogeneous multiprocessors,e.g., each processor has a FPU but is clocked at different speeds. Processors differ in capabilities (and possibly also speed) in unrelated heterogeneous processors,e.g., only one processor has an FPU whereas the others are targeted at fast integer arithmetic and slow I/O processing. The focus of this dissertation is identical multiprocessors.

platforms. In amulticoredesign, multiple (mostly) independent processingcoresare manufactured on a single integrated circuitchipto exploit increases in transistor density (Olukotunet al., 1996; Sodan

et al., 2010). Such systems are sometimes referred to as amultiprocessor on a chip(Tanenbaum, 2005). From a scheduling point of view, “multicore” is thus simply a particular way of implementing multiprocessors.

Another technique similarly used to exploit improved transistor densities for increased paral- lelism ishardware multithreading, where some of a core’s functional units (such as the instruction pipeline) are replicated such that the core appears as multiple “processors” to the operating system. Multithreading allows multiple task contexts to appear to be scheduled “concurrently,” when in fact the core alternates rapidly among hardware threads. This can improve overall throughput by increasing a core’s utilization because a core may quickly switch hardware threads whenever the currently executing thread stalls (e.g., due to a cache miss—see Section 2.1.2 below). However, from a real-time perspective, hardware multithreading can be problematic because it can introduce hard-to-predict execution-time variations (Barreet al., 2008; Jainet al., 2002).

The terms “chip,” “CPU”, “core,” and “processor” are used interchangeably in many documents. To avoid confusion, we adopt the following nomenclature in this dissertation. A computer contains

one or more processingchips. Each chip consists of one or more coresthat perform the actual computations. Each core makes one or morehardware threadsavailable to the operating system. For scheduling purposes, every hardware thread/context that can be independently scheduled by the operating system is aprocessor.

Xeon L7455. As mentioned above, our experimental platform is a 24-core 64-bit UMA machine, which consists of four physical Intel Xeon L7455 chips (Intel Corporation, 2008b). Each chip contains six cores running at 2.13 GHz. Intel systems support a variety of frequency scaling and power-saving techniques. Due to our focus on identical multiprocessors, we do not use any of these techniques and keep each core clocked at full speed at all times. This generation of Intel’s Xeon family does not support “hyperthreading,” Intel’s implementation of hardware multithreading. Each chip contains two levels of shared caches, which we discuss next.