The M5 Simulator - Parallel-Architecture Simulator Development Using Hardware Transactional Mem

In order to test our implementation and evaluate performance of the HTM module we will use The M5 Simulator (12), a research tool for computer system architecture widely used by the community. A complete list of publications using this simulator can be found here (4).

3.2.1 Key Features and Configuration Choices

One of the most important features of the simulator and also very important to make it change-prone is its pervasive object orientation. All the major simulation structures (CPUs, busses, caches, etc.) are represented as objects, M5’s internal

object orientation (using C++) provides in addition the usual software engineer- ing advantages. Using a quite simple configuration language that allows flexible composition of this objects we can describe complex simulation targets. This is important for us, because we will have to modify the simulator in order to use our own cache, directory and HTM objects amongst others.

M5 supports multiple interchangeable CPU models, currently there are three different models: a simple in-order CPU; a detailed out-of-order CPU that is superscalar and has simultaneous multi-threading (SMT) capabilities; and a ran- dom memory-system tester. The first two models use a common high-level ISA description, we will make some modifications over this ISA to provide new in- structions to support transactional executions. We used the AtomicSimpleCPU, it is an in-order, one cycle per instruction CPU. This choice is not done to sim- plify the system, but for consistency with the HTM literature since there are just a few proposals of HTM’s using out-of-order CPUs, and as they proved it is quite challenging (36).

M5 features a detailed event-driven memory system, including non-blocking caches over a simple snooping coherence protocol. Since we need a directory- based coherence protocol as shown in the previous section, we have to develop our own implementation for the memory hierarchy (caches, main memory banks and directories). Thanks to M5’s object orientation, instantiation of multiple CPU objects within a system is trivial. Combined with our module that will define the memory hierarchy we can easily simulate the desired system.

The simulator supports either full-system and system call emulation execution modes. We are interested in full-system capabilities to be able to have a functional environment able to interact with a disk image for example, since we will store our test binaries there. Full-system mode is only available in Alpha and SPARC architectures. Alpha can boot an unmodified Linux 2.4/2.6 kernel as well as FreeBSD, while SPARC can boot Solaris with some constrains. We chose Alpha architecture to be our testing platform, because using Linux we are sure that all the tests that we will use to evaluate performance and scalability will work properly. Note that no Alpha hardware is needed to make full use of M5 compiled with Alpha architecture, because Alpha binaries to run on M5 can be built on x86 systems using gcc-based cross-compilation tools.

Furthermore, M5 is being released under an open source license. It implies an active community around it with good support from its main developers.

Related Work

There have been a number of proposals for Transactional Memory (TM) over the last years. In this chapter we will walk through some of them to provide a global view of the research done by the TM community.

TM proposals that use pessimistic conflict detection such as Log-TM (29) and Unbounded TM (UTM) (11), write to memory directly (eager data versioning). This improves the performance of commits, which are more frequent than aborts. However, it may also incur additional violations not present in lazy data versioning. Moreover, UTM tries to address the problem of limited hardware buffering capabilities, by providing mechanisms to support transactions of arbitrary size and duration in a pure hardware approach. However, UTM is not unique in this field, Virtualizing TM (30) provides different mechanisms that shield the programmer from various platform-specific resource limitations.

Scalable-TCC is based on a previous work called TCC (23), it was the first hardware TM system with lazy data versioning and optimistic conflict detection. However, TCC suffers from two major bottlenecks. First, it utilizes an inher- ently non-scalable communication medium between processors (common bus); and second, all commits are serialized with a commit token which has to be ac- quired by a transaction at commit time. With Scalable-TCC (19) both problems are addressed.

New proposals are trying to come up with new ideas to take the best of both worlds, lazy-like systems and eager-like systems. Eager-lazy HTM (EazyHTM)

scalable and easy to implement. Detecting conflicts while the transaction is run- ning makes commit process much faster, and delaying the resolution at commit time does not incur additional violations.

Architectural Details

In this chapter we discuss the architecture used and the decisions we took about the architectural setup of the whole system. Furthermore, we explain in detail the internals of each main component present in the system.

In document Parallel-Architecture Simulator Development Using Hardware Transactional Memory (Page 32-37)