Multicore Processors - Adaptive virtual machine scheduling and migration for embedded real-time

The Central Processing Unit (CPU) is the component of a computer system that executes program instructions. After fetching the instructions from main memory, the CPU examines and executes them. A multicore processor is a processor with two or more CPUs, typically integrated in a single integrated circuit (chip multiproces-sor). These so-called cores execute instructions in parallel, independent from each other (processor-level parallelism). Different cores can execute different instructions on different parts of the memory at the same time (multiple instructions, multiple data), in contrast to array processors that perform the same sequence of instructions on multiple instances of data. The ISA is in general the same as for single-core pro-cessors, except of modifications to support parallelism, since this enables the reuse of existing software and development tools. [Tanenbaum and Goodman, 1998] Ho-mogeneous multicore processors feature identical cores (same ISA and frequency).

Heterogeneous architectures combine different processing elements, for example a digital signal processor and a general purpose processor [Catanzaro, 1994].

It becomes more and more complicated for processor designers to reach the de-manded performance growth by increasing the frequency of single-core processors, since the associated growths in power consumption and heat dissipation become un-acceptable [Keckler et al., 2009]. Multicore processors address the power issue and achieve a performance growth by parallelism, i.e., increasing the number of processor cores, instead of increasing the frequency of a single core. The execution of multi-ple cores at lower frequency results in an increase of the performance in terms of instructions per second with reduced power consumption.

The cores share main memory and peripheral devices. The main memory is shared uniformly: the latency of an access to a specific memory location is the same for all cores. A problem of shared-memory multiprocessing is memory bus contention when the cores try to access the memory over the same bus. The resulting performance degradation can be reduced by caches, significantly faster but smaller memories that buffer data and instructions between CPU and memory [Tanenbaum and Goodman, 1998]. A multi-level cache hierarchy may be present and caches can be core-exclusive or shared. All components (cores, main memory, caches, I/O devices) are connected by a bus. Figure 2.6 shows such a bus-based shared memory multicore processor with per-core private caches.

Private caches raise the challenge of cache coherence. If copies of the same memory block are stored in multiple caches, problems may arise with inconsistent shared data. If a core modifies the memory block in main memory, the other cores continue

2.3 Multicore Processors 33

Main Memory

Core 1 Core 2

Cache Cache

I/O I/O

Bus

Figure 2.6: Single-bus shared memory multicore with private caches

to access the no longer valid copy in their caches [Culler et al., 1999]. Common mechanisms to ensure coherency are directory-based and snooping protocols. In the first case, a directory stores the information which data is being shared between caches and filters all requests of the cores to load or update memory in their caches.

When an entry is modified, the directory either updates or invalidates the copies in other caches [Moyer, 2013]. In case of snooping protocols, the caches monitor a shared bus for accesses from other caches to memory locations of which they have copies. When a write operation is observed to such a location, the cache controller invalidates its own copy. Another possibility is to broadcast all updates to shared data and update the affected caches [Culler et al., 1999, Moyer, 2013].

According to Symmetric Multiprocessing (SMP), a single OS controls the software execution on all cores. All cores share code and data of the OS and execute both the OS (potentially simultaneously) and application tasks. Asymmetric Multiprocessing (AMP) uses a separate OS instance on each core. Hypervisor-based system virtual-ization has to be considered as a third approach. The hypervisor itself controls the software execution on all cores in a SMP manner and its code and data are shared among multiple cores. But it manages the execution of different operating systems, which operate independently on different cores as it is the case for AMP [Moyer, 2013]. Figure 2.7 illustrates these different approaches to use a multicore processor.

The hypervisor might assign each VM to a single core and make the multicore processor look like a single core to the guest or it might support multicore operating systems by enabling an execution on multiple cores in parallel. Moreover, VMs might be pinned to a certain core (full core affinity) or be scheduled by the hypervisor among

Core 1 Core 2 OS

SMP

Core 1 Core 2 OS

AMP OS

Core 1 Core 2

Virtualization

OS OS

Hypervisor

Figure 2.7: System software’s core management: symmetric multiprocessing, asym-metric multiprocessing, and hypervisor-based virtualization

the cores at runtime (no affinity).

2.3.1 Multicore Scheduling

Scheduling was defined so far only for uniprocessors (see Definition 4 in Section 2.1.2). As just introduced, a processor might feature multiple cores, each a full CPU.

A multicore scheduler has to take an additional decision, the so-called allocation problem: not only which task to execute at any point in time, but also on which core. The first work on multiprocessor real-time scheduling dates back to the late 1960s, when Liu exposed the complexity of the problem [Davis and Burns, 2010]:

Few of the results obtained for a single processor generalize directly to the multiple processor case; bringing in additional processors adds a new dimension to the scheduling problem. The simple fact that a task can use only one processor even when several processors are free at the same time adds a surprising amount of difficulty to the scheduling of multiple processors. [Liu, 1969]

Multiprocessor scheduling algorithms can be classified according to when the allo-cation is made (migration based classifiallo-cation [Carpenter et al., 2004]). In partitioned scheduling, each task is allocated to a specific processor core and executed only on this core (no migration). The scheduler maintains per core a separate ready queue.

Scheduling algorithms where migration is permitted are referred to as global. They use a single ready queue and do not require that all jobs of a task execute on the same core. A further differentiation of global scheduling is based on whether migration is only possible at job boundaries [Carpenter et al., 2004, Davis and Burns, 2010].

In case of task-level migration, different jobs of a task can be executed on different

2.3 Multicore Processors 35

cores, but each job is executed on a single core. Job-level migration permits the pre-emptive execution of a single job on different processors (but no parallel execution of a job) [Davis and Burns, 2010].

Global scheduling has the following advantages [Davis and Burns, 2010]:

• in many cases fewer preemptions (only required if no core idles) [Andersson and Johnsson, 2000],

• spare bandwidth (when a task does not need its WCET) can potentially be used by all other tasks,

• more appropriate for open systems that permit adding tasks at runtime.

Partitioned scheduling has the following advantages [Davis and Burns, 2010]:

• reduction of the multiprocessor scheduling problem to a set of less complex uniprocessor scheduling problems [Carpenter et al., 2004],

• no migration overhead (e.g., communication load for transfer of the context of job/task, additional cache misses),

• overrun of the WCET by a task affects only the tasks on the same core,

• no excessive overhead of maintenance of a single global ready queue for large systems (scalability).

Partitioned scheduling can reuse well-known uniprocessor scheduling results, e.g., schedule the tasks that are allocated to the same core by RM or EDF; whereas using these optimal uniprocessor scheduling algorithms in a global multiprocessor manner may result in arbitrarily low utilization (the so-called “Dhall effect”) [Dhall and Liu, 1978]. The main disadvantage of partitioned scheduling is the complexity of the al-location problem. Finding an optimal alal-location of tasks to cores is analogous to bin packing and therefore known to be NP-Hard [Garey and Johnson, 1979]. By conse-quence, non-optimal heuristic partitioning algorithms are usually applied [Carpenter et al., 2004]. A second disadvantage: there are task sets that are schedulable if and only if migration is permitted [Carpenter et al., 2004].

See Davis and Burns for an extensive survey of both partitioned and global mul-tiprocessor scheduling algorithms [Davis and Burns, 2010].

2.3.2 Multicore and Predictability

Real-time systems require information about the worst-case execution times in order to guarantee a deterministic behavior. The certification of functional safety depends on the ability to determine the system’s exact timing behavior and requires to show that the system reaches a safe state within a specified time interval after a hazard.

This is achieved by analyzing the ECU activities and their timing in an end-to-end event chain [Stappert et al., 2010].

Timing of software is highly dependent on the underlying hardware. A pre-cise determination of the WCETs of single-core architectures is already challenging, since processor features such as caches, pipelines, branch prediction, or co-processors evolved in order to maximize the average performance and complicate the timing analysis. But multiple established methods and tools exist. [Wilhelm et al., 2008]

Multicore processors are significantly more difficult to analyze as the sharing of on-chip resources between cores introduces complex timing effects at the machine instruction level or even nondeterminism. Independently executed software perma-nently competes for accessing shared architectural elements such as:

• system bus,

• memory bus,

• memory controller,

• non-core-private caches,

• DMA controller,

• interrupt controller,

• I/O controller. [Kotaba et al., 2013]

Two issues arise for the determination of WCETs due to shared resource con-tention: nondeterminism and pessimism. Resource accesses might be arbitrated by certain units in a non-explicit manner, introducing nondeterministic delays and mak-ing it impossible to determine the WCET [Kotaba et al., 2013]. Or the complexity of the on-chip dependencies results in very long possible delays for certain operations, and by consequence in extremely pessimistic WCETs, which potentially are no more economically acceptable.

An example is system bus contention: a static worst-case analysis might have to expect that each access is delayed by simultaneous accesses of all other cores plus asynchronous accesses by DMA controllers, as they autonomously access the shared bus, plus potentially asynchronous accesses of additional hardware units.

The severeness of these issues is emphasized by the common solution of the avionics domain for the contention on the system bus: all but one core are actually disabled and any asynchronous DMA or I/O traffic is avoided [Kotaba et al., 2013].

Very challenging are shared caches due to the overhead to keep coherency and due to unpredictable inter-core interactions (a cache miss on one core can heavily impact the performance on the other cores in both directions, increasing or decreasing) [Paun et al., 2013]. Modeling the behavior of shared caches is practically impossible

In document Adaptive virtual machine scheduling and migration for embedded real-time systems (Page 48-53)