• No results found

CISC architectures have evolved since the 1970s. They have a compact ISA; there is a high semantic mapping between CISC instructions and C programming constructs. For this reason executables tend to be very compact. A driver for CISC is to encapsulate complexity at the processor level. CISC architectures often support advanced features to improve ILP for example superscalar execution units.

The philosophy of RISC is to have simpler processors; this can be observed by the primitive load-store ISA. The premise is that removing the complex and infrequently used instructions means that the processor can be made more efficient. However, a portion of this complexity is transferred to the compiler and software runtime system.

ROSC architectures, otherwise known as stack-based architectures employ one or more on chip LIFO stacks instead of a random access register set. Stack machines tend to have simpler hardware and attributes advantageous for embedded real-time systems, Koopman [21]. Stack machines have been implemented in silicon and as abstract machines, for example p-code [40], Forth [19] and the Java Virtual Machine (JVM) [24]. Table 2.4 illustrates the main differences between CISC, RISC and ROSC architectures.

CISC RISC ROSC

Sophisticated instructions Primitive instructions Sophisticated instructions

Deep pipelines Deep pipelines Shallow pipelines

Compact code Verbose code Compact code

ISA close to

programmer’s model

ISA far from

programmer’s model

ISA far from

programmer’s model Easy to program

using ASM

More difficult to program using ASM

Even more difficult to program using ASM Multi-cycle instructions Mostly single-cycle instructions Multi-cycle instructions Specialised resisters Many registers general

purpose

Restrictive on-chip stack

Table 2.4: Summary of differences between CISC, RISC and ROSC

Increasing ILP and clock frequency is becoming very challenging for processor designers. This is mainly due to maturing designs, unacceptable power consumption and heat dissipation. It has been suggested that improving performance can be achieved by employing multiprocessors for explicit parallelism. Amdahl’s Law illustrates the speedup potential of a given program [36]. However, the practicalities of parallel computation mean that Amdahl’s Law is optimistic.

Dataflow architectures are used to increase the granularity of parallelism to the instruction level. A dataflow program is one in which the ordering of operations is implied by the data dependencies rather than explicit control flow. A dataflow architecture employs direct instruction communication. Programs can be expressed using a graphical notation or more simply using trees. The Manchester Dataflow Computer and the more recent TRIPS processor have been introduced. HLLCAs directly support the execution of a high level software programming languages. HLL- CAs can be regarded a subset of LSPs, since they are usually designed to execute a specific high-level language. However, HLLCAs have been heavily criticised by Ditzel and Patterson [35].

LSPs may be regarded as a more general form of HLLCA. LSPs usually refer to a software interpreter and runtime system or JIT compiler targeted at a specific language. Many LSPss are interpreters or virtual machines, Smith and Nair [32]. They are often implemented as stack machines.

A SDLSP is an LSP and has an architecture which follows the grammar rules of the language. The instructions are simple encodings of the source language.

The memory hierarchy includes main memory, caches and stack management components. Caches are used to reduce the adverse effects of disparity in speed between CPU and main memory by taking advantage of spatial and temporal locality, Hennessy and Patterson [8]. If the ways of a set are full for a given address, then an appropriate cache line must be evicted to make space; the cache line to evict depends on the replacement policy. Cache modelling can be used to predict cache behaviour, though it is generally regarded as NP-hard. An alternative to speculative caches is scratch pad memory and this may also help reduce non-determinism. The memory architecture of a multiprocessor system can be organised in a variety of ways including UMA/SMP and NUMA (distributed memory). Multiple copies of the same data may be held in multiple caches. So that all processors see the same value, cache coherency protocols must be implemented by the caches. Cache coherency for UMA/SMP systems uses snooping whereas cache coherency for NUMA uses distributed directory-based approaches, Hennessy and Patterson [8]. Practical implementations tend to be very complicated. Stack spilling and restoration is an aspect of memory hierarchy for stack-based processors. A number of approaches can be used for managing this, including sizing the stack for the worst case depth, demand-fed, paging, caching, function stack, block stack and scratch pad memory, Koopman [21].

The fundamental ISA and architecture of a processor and surrounding subsystems dictate to some extent the basic performance ultimately achievable. In order to optimise performance, designers employ techniques to improve the common case, for example, pipelining, super-scaler

and caching.

Microprocessor vendors are naturally reluctant to change a processor ISA for commercial rea- sons; doing so could undermine the entire product base. Overhauling or even simply modifying it would have massive repercussions for the processor’s eco-system. Modifying the ISA would impact the compiler back-end, including machine specific optimisers and code generators. Other toolchain components such as linkers, loaders, romisers, assemblers, debugger back-ends, profilers, memory checking tools would all be significantly impacted. As a result, vendors may have been hesitant in addressing the problems and improving ISAs.

Chapter 3

Problem Analysis

This chapter outlines the problems that the thesis aims to address. Firstly, the motives for multi- processor systems are discussed. The motives are driven by the current problems faced by unipro- cessor systems. The challenges faced by multiprocessor systems are then presented. In order to address the challenges it is necessary to understand where improvements can be made in CPU design. These improvements are focused on considering fundamental changes rather than optimis- ing the status quo. In order to understand where fundamental improvements can be made, it is first necessary to understand what CPUs spend most of their time and energy doing. This under- standing can be gained by reviewing previous literature and using benchmarks to project energy estimates on the results. After reviewing the initial findings, this chapter finishes by suggesting a possible alternative CPU ISA to address help the challenges.