Meta-Tracing vs Partial Evaluation

2.4 Meta-Tracing

2.4.2 Meta-Tracing vs Partial Evaluation

Truffle and RPython both achieve the goal of building JIT VMs with just an interpreter. This property is crucial for both thrusts of this thesis. In both approaches, the complicated JIT-generation framework is abstracted away. To make use of this framework, one has to write a relatively simpler interpreter. I demonstrate meta-JIT frameworks are excellent tools for building agile hardware simulators at various levels, including functional level, cycle level, and register-transfer level. Secondly, since meta-JIT frameworks allow building JIT VMs for multiple dynamic languages, it makes an excellent workload target to build hardware specialization. While this thesis could have used the Truffle framework, I have chosen the RPython framework for a number of reasons: (1) writing interpreters for RPython is slightly easier as it requires fewer hints [MD15]; (2) PyPy, the currently fastest implementation of Python, is written in RPython and Python has not had as many hardware acceleration proposals compared to JavaScript despite being a more popular language; (3) Python is arguably more general purpose than JavaScript; and (4) not all parts of the Truffle framework are open source (e.g., the Truffle JavaScript interpreter is not open source). Repeating the studies presented in this dissertation using the Truffle framework is a potential direction for future work.

PART I

ACCELERATING HARDWARE SIMULATIONS WITH

META-TRACING JIT VMS

The first part of this thesis focuses on using the meta-tracing JIT technology to enable fast, accurate, and agile hardware simulation. Broadly, hardware systems are modeled at three different levels: functional-level (FL), cycle-level (CL), and register-transfer-level (RTL) modeling. FL models purely focus on the functional behavior of hardware and do not model time. FL models can be useful in initial design space exploration and techniques such as mixed-level modeling and sampling. CL models add timing simulation in addition to functional behavior. CL models do not model timing at perfect accuracy, but can still provide valuable insights as hardware designs are refined. RTL simulations accurately model register-level resources and time in terms of cycle count. While the increasing modeling detail can provide more accurate performance estimations, they also take much longer to develop and to simulate. Part I of the thesis explores using meta-tracing for all three modeling levels. Chapter 3 introduces Pydgin, where I present an approach to develop an interpreter using the RPython meta-tracing JIT framework for a functional model of a processor. Chapter 4 extends Pydgin to introduce JIT-compiled cycle-level models and embedding of the functional-level model inside another cycle-level simulator, allowing CL modeling through sampling and fast forwarding. Chapter 5 proposes novel meta-tracing JIT techniques to speed up PyMTL, a Python-based RTL modeling framework.

CHAPTER 3 PYDGIN: ACCELERATING FUNCTIONAL-LEVEL

MODELING USING META-TRACING JIT VMS

This chapter explores the use of meta-tracing JIT VMs to accelerate functional-level (FL) hardware modeling. Pydgin provides a common framework for productively describing instruction set architectures (ISAs) using an RPython-based domain-specific language (DSL). Pydgin uses these instruction-set descriptions to create high-performance JIT-optimized instruction-set simulators by making use of RPython’s meta-tracing JIT compiler. We implemented high-performance Pydgin ISSs for MIPS, ARM, and RISC-V ISAs.

While Pydgin has been created in collaboration with Lockhart [LIB15, Loc15], I have led the JIT optimizations to make Pydgin high-performance, and co-led the RISC-V port. Also importantly, this chapter highlights that meta-tracing JIT technology can be promising for functional-level hardware design, and is the starting point for my work on accelerating cycle-level hardware design in Chapter 4.

3.1 Introduction

Instruction-set simulators (ISSs) are used to functionally simulate instruction-set architecture (ISA) semantics. ISSs play an important role in aiding software development for experimental hardware targets that do not exist yet. ISSs can be used to help design brand new ISAs or ISA extensions for existing ISAs. ISSs can also be complemented with performance counters to aid in the initial design-space exploration of novel hardware/software interfaces.

Performance is one of the most important qualities for an ISS. High-performance ISSs allow real-world benchmarks (many minutes of simulated time) to be simulated in a reasonable amount of time (hours of simulation time). For the simplest ISS type, an interpretive ISS, typical simulation times are between 1 to 10 millions of instructions per second (MIPS) on a contemporary server-class host CPU. For a trillion-instruction-long benchmark, a typical length found in SPEC CINT2006, this would take many days of simulation time. Instructions in an interpretive ISS need to be fetched, decoded, and executed in program order. To improve the performance, a much more sophisticated technique called dynamic binary translation (DBT) is used by more advanced ISSs. In DBT, the target instructions are dynamically translated to host instructions and cached for future

use. Whenever possible, these already-translated and cached host instructions are used to amortize much of the target instruction fetch and decode overhead. DBT-based ISSs typically achieve performance levels in the range of 100s of MIPS, lowering the simulation time to a few hours. QEMU is a widely used DBT-based ISS because of its high performance, which can achieve 1000 MIPS on certain benchmarks, or less than an hour of simulation time for a trillion-instruction-long benchmark [Bel05].

Productivity is another important quality for ISSs. A productive ISS allows productive develop- mentof new ISAs; productive extension for ISA specialization; and productive custom instrumenta- tionto quantify the performance benefits of new ISAs and extensions. These productivity features are especially important for emerging open-source ISAs that encourage domain-specific extensions. For example, RISC-V is a new ISA that embraces the idea of specifying a minimalist standard ISA and encouraging users to use its numerous mechanisms for extensions for their domain-specific needs [WA17]. The users of RISC-V would likely need their ISS to be productive for extension and instrumentation while still needing the high-performance features to allow them to run real-world benchmarks. Bridging this productivity-performance gap requires agile simulation frameworks.

There has been previous ISSs that aim to achieve both productivity and performance. Highly productive ISSs typically use high-level architecture description languages (ADLs) to represent the ISA semantics [RABA04, ŽPM96, Pen11, QRM04, DQ06]. However, to achieve high performance, ISSs use low-level languages such as C, with a custom DBT, which requires in-depth and low-level knowledge of the DBT internals [May87, CK94, WR96, MZ04, TJ07, JT09]. To bridge this gap, previous research have focused on techniques to automatically translate high-level ADLs into a low-level language where the custom DBT can optimize the performance [PC11, WGFT13, QM03b]. However, these approaches tend to suffer because the ADLs are too close to low-level C, not very well supported, or not open source.

A similar productivity-performance tension exists in the programming languages community. Interpreters for highly productive dynamic languages (e.g., JavaScript and Python) need to be written in low-level C/C++ with very complicated custom just-in-time compilers (JITs). A notable exception to this is the PyPy project, the JIT-optimized interpreter for the Python language, which was written in a reduced typeable subset of Python, called RPython [BCF+11a, BCFR09, AACM07]. To translate the interpreter source written in RPython to a native binary, the PyPy community also developed the RPython translation toolchain. The RPython translation toolchain translates the

interpreter source from the RPython language, adds a JIT, and generates C to be compiled with a standard C compiler. Chapter 2 provides an extensive introduction on RPython, meta-tracing JIT compilation, and related techniques. Pydgin is an ISS written in the RPython language, and uses the RPython translation toolchain to generate a fast JIT-optimized interpreter (JIT and DBT in the context of ISSs are very similar; we use both terms interchangeably in this chapter) from high-level architectural descriptions in Python [LIB15]. This unique development approach of Pydgin and the productivity features built into the Pydgin framework make this an ideal candidate to fill in the ISS productivity-performance gap.

In the remainder of this chapter, Section 3.2 provides an introduction to the Pydgin ADL and Pydgin framework, Section 3.3 explains the JIT annotations used in Pydgin to create a high- perfromance ISS, Section 3.4 provides examples of Pydgin productivity, Section 3.5 highlights the performance of Pydgin, and Section 3.7 concludes.

In document Co-Optimizing Hardware Design and Meta-Tracing Just-in-Time Compilation (Page 46-50)