Computer Architecture and Simulation - Interaction of Hardware Transactional Memory and Micropr

2.6 Summary

2.6.2 Computer Architecture and Simulation

The field of computer architecture has advanced tremendously in the past four decades. Modern CPUs provide several orders of magnitude more performance due to advancements across the entire stack from materials to high-level architecture.

One key concept is parallelism which has been used to improve application performance through: pipelining – shortening the work per stage, enabling higher clock frequencies,

instruction-level parallelism – executing multiple independent instructions at the same time,

memory-level parallelism – performing multiple memory operations at the same time to overlap their stall times,

data parallelism – wide vector units performing identical operations on multiple data items in parallel, thread-level parallelism – executing several threads or programs at the same time on multiple hardware

threads or cores.

Furthermore, speculation is used to reduce the impact of critical paths by speculating they will behave in a specific way thereby unlocking parallelism, and rectifying the speculation in case it was incorrect. Fi- nally, using locality, the speed of the compute units increased mostly independently from the much slower improvements of the memory system: caches provide low-latency, high-bandwidth access to frequently used data.

The rising complexity in CPU architecture causes rising costs for manufacturing, but at the same time needs more careful tuning of the separate components to provide a balanced design. Simulation is widely used in academia and industry to predict performance of specific features in both typical usage and ded- icated corner conditions (running stress tests, performing big scalability studies). In addition, simulation (and emulation) allow software and hardware co-design and enable software adaptation before silicon is

available – enabling shorter time to market. In line with the complexities of the CPUs, simulators have become more complex too. Depending on the level of detail modelled, they cause significant slow-downs (10x - 100,000x) for the applications that are being analysed. Several techniques exist to accelerate simulation by extracting core regions of interest, sampling, or abstracting the actual simulation step.

In this thesis, I use detailed cycle-level simulation of a modern out-of-order CPU architecture with a realistic memory hierarchy to prototype a realistic HTM mechanism. Basically a mechanism to speculate through critical sections and executing them in parallel so that application critical paths are reduced. One key challenge is the coordination between the different existing parallelism and speculation mechanisms and those that are provided / required by HTM.

Instruction-Set Architecture and High-Level

Design of HTM

3.1 Introduction

Making hardware transactional memory available in a microprocessor involves integration of the required new instructions into the instruction set architecture (ISA) of the system where it is to be used. Before that, however, the architecture level properties of HTM need to be specified and the required new instructions need to be derived. In this chapter, I will describe the key integration points, architectural mechanisms, and desirable high-level properties of HTM on the example of my work on AMD’s Advanced Synchronization Facility (ASF).

ASF is an experimental extension to the AMD64 instruction set [257], which in turn is the 64 bit extension to the widely used x86 ISA. Going from the description of generic HTM primitives and architectural functionality to detailed real-world challenges when integrating with a naturally grown ISA, this chapter will provide a summary of the elaborate architectural design process and resulting choices and implications of the chosen design points for ASF.

ASF is the concrete example, but many trade-offs can be transferred to generally integrating best- effort HTM (BeHTM) in other ways and in other base-line ISAs. In fact, with the manifestation of several other industry-grade BeHTM proposals ([281, 284, 340]), the different design choices can be examined in the “wild”.

In this chapter, I will give as much background as I can for why specific choices for ASF have made in the way they are and explain reasoning about associated costs and concerns in an industrial setting. The ASF design effort culminated in an official AMD experimental ISA specification document for ASF with all instruction encodings and interactions. Since the specification has all the detail, I will only briefly introduce the instructions and refer the reader to the “Advanced Synchronization Facility – Proposed Architectural Specification” which I have attached in Appendix A.

In addition to providing context, summary, background, and reasoning, I will focus on changes (bug fixes, small tweaks and adaptations) and clarifications made after the publication of the specification. I will present major design reconsiderations and options in Chapter 6. The architectural design aspect of ASF has also been described in the following publications:

• “Evaluation of AMD’s Advanced Synchronization Facility within a Complete Transactional Memory Stack” at EuroSys 2010 [213]

• “The Velox Transactional Memory Stack” at IEEE Micro Journal [210] 57

• “ASF: AMD64 Extension for Lock-free Data Structures and Transactional Memory” - in MICRO 2010 [214]

• “Implementing AMD’s Advanced Synchronization Facility in an Out-of-Order x86 Core” - in TRANS- ACT 2010 [220]

• “Compilation of Thoughts about AMD Advanced Synchronization Facility and First-Generation Hard- ware Transactional Memory Support” - in TRANSACT 2010 [215]

• “From Lightweight Hardware Transactional Memory to Lightweight Lock Elision” - in TRANSACT 2011 [254]

The remainder of this chapter is organised as follows: the remainder of this section will summarise concepts required to integrate BeHTM (Sub-section 3.1.1), Section 3.2 will present the actual ISA extensions, and Section 3.3 will show a simple prototype for integrating the ISA extensions into C / C++. In Section 3.4, I will highlight incremental changes made to ASF to make it easier to use; and in Section 3.5, I will discuss ASF’s capacity and progress guarantees. Finally, Section 3.6 will summarise this Chapter and ISA design for BeHTMs.

In document Interaction of Hardware Transactional Memory and Microprocessor Microarchitecture (Page 66-70)