Central Processing Units - Hardware acceleration technologies

2.5 Hardware acceleration technologies

2.5.1 Central Processing Units

The modern conception of a CPU as a stored-program computer was first described by John Von Neumann [85]. The fundamental ideas about how a CPU operates have remained largely unchanged, although various CPU architectures have been proposed since then.

Figure 2.7 provides a high-level abstraction of the architectural parts of a typical CPU. It consists of:

1. An arithmetic logic unit (ALU), which performs arithmetic and logic operations. These operations are encoded as instructions. Most CPU architectures have fixed instruction sets and all high-level code has to be translated and decomposed into these instructions.

2. A set of registers from which the ALU reads its inputs and writes its outputs.

3. A control unit that fetches instructions from the program memory, decodes them and then feeds them to the ALU for execution. This process involves the activation of various control signals and interactions between the necessary parts of the ALU, registers and other components.

The CPU also communicated with the main (off-chip) memory through a memory interface module. Modern CPUs work are based on the same principles but they are much more sophisticated in terms of how they process instructions and access memory. They are equipped with complex hardware components to optimize the execution of sequential code, e.g. they use techniques such as Instruction Level Parallelism and Out-of-Order Execution to reduce the execution time of successive sequential instructions and speculative execution and branch prediction to improve concurrency and tackle branching statements more efficiently.

Apart from the above techniques, CPU architectures have made extensive use of multiple layers of cache memories in order to reduce memory latency (which is large when accessing off-chip memory). The idea behind on-chip cache hierarchies is that a fast cache with small size is placed close to the ALU in order to achieve low latency, a slower cache with larger size is placed at a large distance from

Instruction decoder Registers Instruction fetcher Memory interface ALU

Figure 2.7: High-level abstraction of a typical CPU. The instruction fetches read instructions from main memory, the instruction decoder sends the necessary control signals to implement the instruction and the ALU executes the instruction using inputs from the registers as operands and writing the results back to the registers or the main memory.

the ALU, etc. While CPU registers can typically be accessed in one clock cycle, access times for caches range up to 10 or more cycles and access time for off-chip memory in in the range of 100-200 clock cycles. The above cache model is critical for performance in modern CPUs. A large part of the transistors in a CPU is used for implementing large caches, as well as increasing their performance by using complex techniques that exploit the spatial and temporal locality of memory accesses.

The sequential programming model and its decline

Since the first commercially available CPUs were released by Intel in 1970 and especially after 1974 (the year that Intel 8080 was made available), the CPU market has grown at an impressive rate and CPUs can now be found in almost every digital device in the planet. The success of CPUs is due to their flexibility and their constant - until recently - increase in performance. Flexibility refers to their ability to run any sequential software code by decomposing the code into simple generic instructions which are executable by the CPU (using a compiler). This has led to a massive code base which can be easily run in newer CPU generations. The increase of CPU performance is related to the huge innovations in integrated circuit technology over the last decades. This is frequently connected to

Gordon Moore’s prediction back in 1965, which has since been known as Moore’s law: “The amount of transistors in a given amount of silicon will approximately double every 18 to 24 months” [86].

All these extra transistors have traditionally been used to enhance the various hardware modules and cache memories inside the CPU in order to increase performance, e.g. through improvement in the ALU pipeline, hyper-threading, using speculative execution, increasing the size and capabilities of caches, etc. Moreover, the simultaneous increase in transistors’ switching speed has constantly pushed clock frequencies up. These two elements meant that programmers did not have to care about the capabilities of the underlying hardware; increasing frequencies allowed the same sequential program to run faster on newer CPUs. Therefore, developers only needed to wait for one or two extra years for the next CPU model to be released in order to cover the increased computational demands of applications.

This “free lunch” ended in 2004 because integrated circuit technology hit a “power wall” [87, 88]. It is no longer possible to increase clock frequencies while keeping the power envelope of the CPU at safe levels. This is due to limitations in transistor technology, i.e. the fact that power leaks in transistors increase at unsustainable levels as feature size drops. When feature sizes reach 90mm, the leaks become significant and it is no longer straightforward to dissipate the generated heat from the chip. At that point, the industry had to stop relying on increasing frequencies and instead find other ways to improve performance.

In document Algorithms and architectures for MCMC acceleration in FPGAs (Page 61-63)