• No results found

In Section 2.5.1, I will describe the related work of modern CPU architecture, and in Section 2.5.2 show existing work on how current and future CPU designs can be evaluated through simulation.

2.5.1

CPU Architecture and Micro-Architecture

Microprocessors came to light commercially with the Intel 4004 in 1971, with 2300 transistors, manu- factured in a 10 µm process, running at 740kHz. Today (2017), a typical example CPU such as the AMD Ryzen 1800X is comprised of 4.8 billion transistors, manufactured in a 14 nm process, and can run at up to 4.0 GHz, and also can execute 16 threads in parallel.

Generic CPU design Obviously, in 46 years, many advances in CPU design, and manufacturing were made – many more than there is space for in this section. Standard text books present the key develop- ment steps and are continuously updated [123]. Key techniques of CPU designs that are optimised for performance are:

Pipelining splits up the actions of instructions into multiple stages so that each stage becomes shorter, allowing higher clock speed, and multiple instructions can be executed at different stages [14, 20, 60]

Super-scalar execution can execute more than one instruction at the same time, making use of instruction- level parallelism (ILP) and is often combined with pipelining [23]

Out-of-order execution furthermore allows instructions to be executed independently of program order, extracting more ILP and memory-level parallelism (MLP) [1, 10, 38]

Caches store frequently used data and instructions closer to the execution units and reduce access la- tency and increase bandwidth [12]

Vector architectures exploit data-level parallelism and perform the same (arithmetic) operations on all elements of a vector [2, 13, 43, 161, 376]

Multi-core CPUs extract higher-level (task / process) parallelism and allow multiple application pro- cesses or threads to execute at the same time on their own cores, but inside the same system (disk, memory, I/O) [42, 75]

Speculation predicts application and CPU behaviour and allows to shorten the critical path and unlock- ing more parallelism [37, 50, 77]

Branch prediction is a special type of speculation that predicts whether conditional branches are taken or not, and where indirect branches will point to. That reduces fetch / decode latency, and unlocks a deeper instruction window for out-of-order execution [26, 56, 63]

In addition to extending the structure of CPUs, the instruction set ISA of CPUs is constantly evolving to give applications access to performance enhancing features, and also to add other features, such as security. Currently, two ISAs are dominating the market, Intel and AMD’s x86 ISA and its extensions [257, 367], and the ARM ISA [351].

Synchronisation The most relevant aspect of multi-core CPUs and ISAs is that of synchronisation. Paral- lelism and the ability to find and exploit it are great as they make things go fast. In some cases, however, we do want to very carefully control the parallel execution, particularly, when coordinating access to a single shared resource, such as the screen, disk, or when waiting for several parallel computations to end so that their results can be combined.

All modern ISAs provide at least one primitive for synchronisation with infinite consensus num- ber [28]. On x86, compare-and-swap is offered (cmpxchg), and IBM Power and Arm provide equally strong load-linked / store-conditional instruction pairs ldarx / stdcx and ldrex / strex, accord- ingly [18, 257, 351, 353, 367]. Further extensions are double-wide CAS (for timestamps and pointers to avoid the ABA problem [41]) as cmpxchg8b, cmpxchg16b, and many other atomic load-op-store instruc- tions on x86. The Arm ISA recently also got support for atomic load-op-stores in the Armv8.1-A version of the ISA [352].

Even through CAS and LL/SC have infinite consensus number and can emulate all other synchroni- sation primitives, several extensions to it have been proposed: double compare-and-swap instructions (DCAS) operate on two independent data items. The Motorola 68k processor provided such an instruc- tion and it has been used in OS kernels [30]. Knight proposes a TM-lite proposal with a load / compute prefix and a single store to shared memory [16].

A good recent summary article was written by David, et al, evaluating different synchronisation prim- itives and how they perform [288]. They compare a variety of synchronisation primitives on a variety of different cache coherent systems. Generally, synchronisation performance in focussed testing depends very much on system topology decisions and choices made in the cache hierarchy. For example, communi- cation within the same socket is usually faster, in some systems, however, home nodes are used to provide a central ordering point. These points can cause unnecessary round-trips for messages despite the data living on the same socket. Furthermore, simple locks seem to work well in single socket systems, while more complex locks (hierarchical, MCS, ticket) perform better in systems with more complex topologies.

2.5.2

Simulation

Manufacturing CPUs is expensive, therefore, new features such as Transactional Memory are first im- plemented in simulators that model key aspects of future CPUs and systems so that performance and operational studies can be performed ahead of manufacturing time.

Similar to the complex CPU microarchitectures that are modelled, a wide variety of simulators has been proposed in academia and commercial offerings. Furthermore, many CPU manufacturers use their own in-house simulators that are sometimes fully home-grown, and sometimes extend other public sim- ulators.

Broadly speaking, there are four types of abstractions for simulators exist:

Behavioural or ISA-level simulators execute instructions and ensure that all instructions behave as if they were executed on a real CPU. They are useful for testing new ISA extensions and usually are used interactively. They only provide a crude approximation of temporal (IPC) and micro- architectural behaviour (such as branch predictors, caches, pipelines). They main goal is typically to provide comprehensive instruction emulation, and high speed. Well-known examples are Sim- ics [61], and QEMU [81]. Slowdowns typically are on the order of 10x compared to native execu- tion.

Cycle-level simulators allow performance analysis for new CPU features. As such, they model key micro-architecture features, such as pipeline structures, branch predictors, caches and memory. Sometimes, these simulators also provide a behavioural model in execution driven simulation, and in other cases, they are trace-driven or merged directly into the execution binary for example through binary translation [162, 295]. Due to the added modelling complexity, cycle-level simulators are significantly slower than ISA-level simulators, and typical slowdowns are 1000x − 100000x. Simula- tion times can be reduced by limiting focus to the most interesting time periods [69, 361], and the relevant subsystems [93].

Abstract or analytical simulators do not execute instruction by instruction, but instead abstract away low-level details. Ranging from models that inspect the instruction dependency graph to estimate impacts of memory stalls [356], simple stall models [242], parallel phase synchronisation mod- els [344, 382], all the way to fully analytical regression models [57, 76, 78, 83, 96, 97, 99, 100, 264], all of these can significantly reduce simulation time and allow faster design-space exploration – sometimes even with mathematical help such as differentiable abstract models and gradient de- scent.

For my work, behavioural and cycle-level simulators are the most appropriate as we want to study both the ISA-level behaviour, micro-architectural behaviour, and performance in more detail.

Behavioural Simulators Simics [61] is a flexible, extensible behavioural simulator that simulates mul- tiple ISAs, including x86-64 and ARM, and executes full system stacks including the kernels of Linux and Windows. In addition to running applications and operating systems on an ISA level, for example for software porting, Simics has a big API that can hook timing simulation modules for the CPU and devices to the behavioural engine. One interesting aspect is that Simics supports out-of-order execution to drive such timing simulators and can execute instructions with infinite reordering windows and can undo spec- ulation in case of exceptions. Simics is closed-source software, under a proprietary license. Originally (with Virtutech), academic free licenses were available, but these seem to not exist anymore. Simics is often used to drive the behaviour and system IP of academic simulators that focus on the memory subsystem or add a more detailed timing core model. Simics achieves about 5 - 10 MIPS on a 933 MHz Pentium-III host; roughly a 100x slow-down, without additional analysis plugins.

QEMU [81] is an open-source simulator that uses binary translation to provide execution of both full-system and application-only execution of different ISAs on a variety of host systems. In addition to instruction emulation, QEMU also provides emulation of devices. In that capacity, QEMU is frequently

used for virtualisation with KVM [125]. Because of the focus on speed, QEMU does not provide explicit hooks for additional timing simulation simulators.

PIN [92] is a proprietary dynamic-binary translation tool that executes applications and allows “mix- ing” in instrumentation. Instead of emulating instructions, PIN rewrites the existing instruction binary stream, while adding code that provides additional (analysis) functionality, such as calls to cache models, or data flow tracking. PIN executes x86 on x86 only, and depending on the amount of additional instru- mentation, the overheads can be very small. Other DBT tools are: DynamoRIO [59], Valgrind [128], fastBT [230], and HDTrans [131].

Several other hardware vendors have provided simulators / emulators similar to QEMU and PIN with focus of faithfully modelling the ISA and system models. AMD’s SimNow [260] is used for bringing up new operating system kernels and applications, for example it has been used to port Linux to x86-64 before hardware was available [79].

ARM provides FastModels [369] which similarly is used for platform, OS, and application bring-up. Cycle-level Simulators Several families of cycle-level simulators exist: gem5 [241] is a combination of the memory subsystem simulator of the GEMS [93] simulator (Ruby) that provides a detailed, con- figurable memory hierarchy with a lot of detail for evaluation cache coherence protocols, and the M5 simulator [115] which provides the CPU models, easy composability of models, and event-based timing simulation. Gem5 is a full-system simulator and most prominently models both x86 and ARM ISAs (fur- ther ISAs are supported, but deprecated), provides a modular system composition with different types of cores, system components, and various abstraction / speed levels. The core of gem5 is an event-driven simulator which fast-forwards simulation models to when events happen.

PTLsim [135], and its descendant Marss86 [253] focus on out-of-order cores and the x86 ISA. Both provide ways to fast-forward simulation to regions of interest and perform full-system simulation. While PTLsim uses the Xen hypervisor [72] as a hardware abstraction layer and supports switching between native execution and simulation, Marss86 relies on QEMU as the fast-forward architectural execution model and switches between QEMU and the detailed simulation core. Marss86 furthermore adds a much more detailed cache hierarchy and coherence protocol over the fully-private cache hierarchy in PTLsim.

Graphite [229] uses a direct execution mechanism that runs applications through a behavioural sim- ulator and collects stats about executed code and feeds those asynchronously to a cycle-level timing model [46]. Over the original work, Graphite focusses on high core counts, user-space only simulation, and can distribute simulation across many physical host machines with lax timing synchronisation. Dis- tribution, lax synchronisation, abstraction and the use of analytical models make Graphite a good tool to inspect large-scale scalability of applications, less so for understanding detailed pipeline and cache interactions.

Sniper [242] extends the concepts of Graphite, but adds more detailed modelling of stall behaviour and quickly executes non-stalling instruction windows, and discovers ILP / MLP of independent instruc- tions if the head instruction is stalled. Sniper is implemented in the Graphite framework, but extends it in the following areas: overall improved memory hierarchy, MSI cache snooping behaviour, better branch prediction, and basic cost functions for thread sleep / wakeup.

More recently, ZSim [295] followed similar approaches by accelerating sequential simulation through the use of binary translation and direct execution, a two-stage parallelisation of the simulation, and finally a small abstraction layer to isolate the user-space only simulation from system-level specifics.

In recent related work, I have helped improve the field of simulation through: abstraction of the core pipeline by elastic memory traces [356], providing accurate power estimation capabilities [364, 375, 379], simulating distributed systems on top of distributed systems [368], and using a novel way of simulator cloning to deterministically model fused heterogeneous cores [378].

For my thesis, I have extended (and maintained) both PTLsim and Marss86. I have implemented a detailed hardware transactional memory mechanism with realistic pipeline and memory system inte- gration, and appropriate instruction mnemonics and extensions to both ISA level and microarchitecture. Furthermore, I have repaired and enhanced both simulators: PTLsim originally only modelled private per-core cache hierarchies which I extended with a simple MSI coherence model; both simulators were not modelling logic for in-order execution of loads which I repaired; finally, Marss86’s memory hierarchy is very detailed,but suffered from deadlocks (and other smaller bugs) that I repaired in this thesis9.