α-coral: A Multigrain, Multithreading Processor Architecture*

(1)

αααα -Coral: A Multigrain, Multithreading Processor Architecture*

Mark N. Yankelevsky Constantine D. Polychronopoulo s

Center for Supercomputing Research and Development

University of Illinois at Urbana-Champaign 1308 W. Main St.

Urbana, IL 61801

[email protected], [email protected]

ABSTRACT

Recently popularized hardware multithreading (HMT) architectures, such as SMT, Multiscalar and Terra do not provide flexible and efficient methods of thread management and synchronization in hardware. Theα-Coral architecture is a tool for investigation of a more dynamic approach to thread management. Unlike other architectures, there are no strict requirements on timing and size of threads, and no static partitioning of resources. α-Coral provides for simultaneous multiprogramming and multithreading environment, which is mostly managed in hardware. To other architectures, α-Coral adds on demand register allocation, fast variable size thread creation and destruction, as well as quick synchronization through a shared register file. While other architectures attempt to port existing compilers, the α-Coral architecture is supported by a custom compiler system. This system provides for a simple method of mapping hierarchical internal representation of the program to variable size threads.

This paper examines a new approach to hardware multithreading, involving minimal extensions to the instruction set of conventional RISC superscalar architectures. The α-Coral architecture and compiling support introduce a multi-grain multithreaded architecture which extends wide-superscalar processor cores to support hierarchical multithreading. A simulator was developed and results are presented to demonstrate the feasibility of our design approach.

KEYWORDS

Multithreading, processor architecture, parallelizing compiler

1. INTRODUCTION

In a typical processor the state of an execution stream is contained in a program counter (PC) and registers (RF), while the execution units perform computation. When multithreaded processors were proposed, the goal was to share the execution units among the multiple execution streams to achieve higher utilization.

However, the program counter and register file structures are not shared. The existing multithreaded processors are designed to contain a fixed number of threads. The state of each thread (PC and RF) is reserved for threads even before the thread is created.

The advantage of multithreaded execution comes from the ability to switch or intermix threads to keep execution units busy.

Therefore, availability of more threads would be beneficial to the execution paradigm. However, the number is limited by the state storage. One solution to the problem is increasing the size of the state structures (PC and RF). Yet, this is not a scalable solution, since a program can always introduce more threads than the structures support. Another drawback of this solution is the decreased time-averaged utilization of the larger structures.

The unique architecture ofα-coral allows for satisfaction of both goals: relatively large number of threads and sharing of the state structures between these threads. Threads are generated and destroyed through the use of ordinary machine instructions. The state of the currently executing threads is stored in the processor itself. The number of threads is limited only by the size of the program counter holding structure. The structure may contain anywhere from one to hundreds of PCs, depending on the implementation. However, unlike other processors, not all of the PC’s are guaranteed a minimum number of registers in the RF to commence execution.

Each thread will require a variable number of registers. This number is not known until the execution begins. The advantage of this approach is that a greater number of running threads can be accommodated by the hardware. The disadvantage is that program execution, may be hindered by inability to allocate registers. These issues will be addressed in the paper.

The proposed architecture can efficiently handle different types of latency ranging from cache misses and branch mispredictions to exceptions and page faults. In order to allow the processor to make progress in the presence of such a wide range of stall conditions our design provides for the handling of a variety of thread granularities. In order to mask short latencies, the thread switching time is correspondingly small.

Besides RF and PC sharing, the CPU design incorporates interesting architectural features such as implicit synchronization

* This research was supported by National Science Foundation grant EIA 99-75019, research support from NSA, and a research grant from Intel Corporation.

Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.

ICS ’01 Sorrento, Italy

(2)

registers. To fully utilize these features, the PROMIS [8] compiler is used to compile forα-Coral. This paper will briefly discuss the procedures, requirements and the features of the compiler, while focusing mainly on the processor architecture.

Section 2 discusses previous work done in the area. Section 3 gives a general overview of the hardware thread concept in α- Coral. Sections 4 and 5 discuss the architectural aspects and tradeoffs relevant to various implementation options. Design challenges are discussed in Section 6. Experiments and data generated using the simulator are then presented in Section 7 and applied to evaluate certain architectural aspects of the α-Coral.

Section 8 talks about the compiler implementation forα-Coral. In conclusion, section 9 summarizes architecture and results.

2. PREVIOUS WORK

Two main directions in hardware multithreaded implementations were introduced in the past 10 years. The first is a fully customized trend, resulting in supercomputers with a completely new instruction set and redesigned hardware. This approach is costly and the hardware produced is not backward compatible.

The other approach was to reuse existing superscalar designs and extend them to support more than one instruction stream.

Initial fully custom multithreaded systems came from the early designs of fine-grained multithreaded processors, starting with the Denelcor HEP [9], which was later followed by the Horizon [5]

and the Tera [2]. Their result was a processor with poor performance on single thread execution, but a high theoretical maximum throughput. These systems rely completely on latency tolerance for memory accesses due to the lack of caches, so they are limited to specialized applications. MIT Alewife [1] uses another type of multithreading, involving coarse-grained context switching. The processor switches threads on a long latency cache miss. High cost of context switch forces the processor to execute each thread to completion.

The Wisconsin Multiscalar processor design contains duplicate processing units that share the caches and the register files through a network of interconnects. Tasks, which are recognized in control flow graphs, are scheduled by a sequencer to the processing units with the appearance of executing a single application. The Multiscalar design uses information generated by the compiler to extract threads from a single application to be run in parallel. However, it does not provide a multithreaded environment to run different applications simultaneously. In this sense, the multithreading aspect of a Multiscalar design is hidden from the user.

The second direction in hardware multithreading aimed at staying close to the existing superscalar designs and extended them to support more than one instruction stream. α-Coral and the University of Washington’s Simultaneous Multithreading (SMT) project belong to this category. The SMT model relies on the ability to issue instructions from multiple programs simultaneously, hence boosting throughput more than parallelism.

The major design issues that are still considered open research problems include thread context representation, register file design and allocation of registers to threads, thread issue selection policy, thread switch time and methods, communication between threads, and the degree of compiler involvement.

Simultaneous multithreading is different from the Multiscalar design in that even the execution units can be shared between different threads. A simultaneously multithreaded processor can

fetch mixed instructions from different threads and issue them to the many execution units in the system. In this sense, there is no cost for context switching. A good single-threaded execution rate can be achieved without modifications to the architecture.

The SMT [11] architecture was initially designed to run up to eight separate distinct programs on a single superscalar machine.

Minimal or no code changes were required, although substantial OS support is assumed. Each thread has it own 32-register context and no communication is necessary. The work concentrated on trade-offs between different thread selection policies and on interconnection constraints between instruction issue and execution ports. The experimental results suggested that the most efficient design held four immediate thread contexts (program counters) in the fetch units, and a partially interconnected port matrix. Smart thread selection schemes were implemented, based on the history driven probability of stall frequency.

Recent developments in the SMT architecture include improvement in the register file utilization [7]. The approach uses compiler hints to allocate and deallocate each physical register on- the-fly. Thread communication [12] has been introduced through two new instructions (acquire and release), as well as a locking hardware structure (lock_box). Compiler support was also introduced [6], which relaxes and modifies some of the requirements on code scheduling and data access used by current compilers. Dynamic, but speculative thread creation has also been investigated using the SMT [14].

Unlike simultaneous multithreading, thread selection inα-Coral is only performed during fetch. Once instructions are fetched, they compete freely with each other for execution units regardless of the originating thread. By not using direct assignments of the execution units, much less front-end bandwidth is required to supply the issue unit with as many decoded instructions from alternative threads as possible. The requirement for buffering many different instructions from different threads is reduced.

This simplification reduces the required widths of the fetch and decode units. This brings the requirement closer to that of a single-threaded processor.

3. HARDWARE THREADS IN αααα -CORAL

To better understand the differences between α-Coral and other architectures, the concept ofα-Coral hardware thread needs to be explained.

Theα-Coral architecture uses low-level, fine-grain threads where each thread has its own Program Counter value. Each thread has a private section in a large register file, called “segment”. An initial thread is loaded into the processor by a boot procedure, and thereafter threads are spawned by the instructions in the currently running threads.

Executing specially designated instructions creates threads. Once a thread is created and begins execution, the processor allocates a segment in the register file for that thread to use. The size of the segment varies depending on the register requirements of the corresponding thread. Integer and floating point register files are handled separately, and the segments are allocated based on immediate need.

Once created, the instructions of the thread are continuously fetched using the given program counter. Certain System instructions, appearing in the thread, can be used to pause thread execution, pending a certain value of a shared register. Paused threads do not give up their program counter or their register file

(3)

sections. The thread runs to completion, until other System instructions are encountered to terminate the thread.

Three methods are used to communicate between threads. The first is a shared register file. A small (8-16) register file can provide for fast data and control synchronization. Another method is a conventional link-load/store-conditional type communication through memory. The last method is parameter passing at creation time.

The instruction set of α-Coral is based on the MIPS IV instructions set with some extensions for instructions needed to create, destroy and synchronize threads. The TFORK

<targetPC> [DIE]instruction starts a thread attargetPC address in the memory. The current thread can be optionally deallocated by specifying theDIEbit. Another instruction creates multiple threads, which is used for direct hardware implementation of parallelizable loops: DOALL <tagetPC>

<iterR> <constR>starts a number of threads specified by a thread-local registeriterR. Each of the created threads receives a value with its serial number in its local register. In addition, the value of the constR register is passed into a predefined local register of each created thread.

TheBLOCK <source SR> <target R/C>instruction stops the execution of the current thread until the Shared Register (source SR) is equal to a target value (target). HALTstops the execution of the current thread. The STORES <normal store parameters> <DIE>instruction performs a normal store and if DIE bit is set, terminates the current thread. The ADDS/SUBS <dest SR> <source SR> <source R/C>

instructions perform atomic operations in the shared register file.

These instructions and their variations transfer the responsibility of thread management from the runtime system to the processor and compiler. Thus reducing the granularity of threads. For example, implementing a DOACROSS type loop with multiple threads in α-Coral will remove most of induction variable operations. Barrier and critical sections are implemented using BLOCK and ADDS/SUBS instructions, with out calls to the runtime system.

4. ARCHITECTURAL DESCRIPTION

Unlike SMT, α-Coral targets single application performance for speedup. Hence several decisions were made about compiler support and instruction set modifications. The base design forα- Coral consists of a wide-issue out-of-order five-stage superscalar processor. Stages include, Fetch/Decode, Rename, Issue, Execution, and Commit. Instruction execution is performed out of order, using a reorder buffer (Execution Queue) to buffer ready to execute instructions.

To this base design, several components are added to support multithreading:

• PC Queue stores the program counters and minimal additional information about the state of the thread.

• System and DOALL units are functional units, which execute thread manipulation instructions added to the ISA.

• Segment Table is added to a large register file to holds the information about the location of thread-private registers.

The following sections will describe the architectural components in greater detail, emphasizing specific multithreading aspects of the design.

4.1 Instruction Fetch Decode Stage

This stage fetches and decodes instructions from instruction streams. Program counters for the streams are stored in the PC Queue. α-Coral does not explicitly “switch threads”, all of the program counters are used selectively and simultaneously to provide workload for the rest of the processor. Therefore, during most execution cycles no context needs to be flushed or reloaded from memory. Complexities of this execution paradigm are discussed in the next section.

The best fetch policy varies depending on thread mix and the structure of threads being executed. Each policy has two aspects:

which threads to choose and how many instructions to fetch from these threads. The first aspect is extremely important, since the threads may potentially be coming from the same application and could be waiting for each other’s computational results. The second aspect allows for varying the instruction level parallelism seen by the executing core, since the instructions from separate threads cannot be register dependant on each other. On the other hand fetching a small number of instructions from each thread could be inefficient, due to a penalty associated with I-Cache performance.

Detailed discussion about the Size aspect and how it relates to fetching delays can be found in [11]. Yet, the facts that the threads are not controlled by a runtime system, there are a larger number of threads, and these threads are produced by the same context distinguishα-Coral from SMT.

Branch prediction is performed using a Branch Target Buffer shared between all of the threads. This minimizes branch misprediction rates for threads that are created from the same code, such as loop iteration threads.

4.2 Program Counter Queue

This structure contains the program counters of threads that have been created in the processor. The Program Counter(PC) Queue is initialized with one entry at the startup. Then the queue is updated with new program counters created by the instructions executed in the System and DOALL functional units. When a thread completes, the corresponding entry in the queue is freed.

Besides the program counter itself, each entry contains State and Segment identifier. State contains the information used by the Fetch Unit. The fetch unit can skip the thread if the thread is waiting for a memory operation, suspended due to a BLOCK instruction, or a System instruction execution. Segment identifier ties the thread to its private register file section.

Threads are created by the System instructions by inserting the targetPC(see Section 3) into the next available location of the PC Queue. DOALL instructions will attempt to insert many program counters into the queue. In either case, the thread containing a System or DOALL instruction does not continue until all of the threads that have to be started are created.

4.3 Register Files and Segment Table

MIPS ISA calls for three register files. Integer and Floating Point are large continuous register files. They are logically segmented and the segment ranges are stored in the corresponding Integer and Floating Point Segment Tables. Additionally, Shared Register File is a standard register file structure provided for thread communication. Registers files are accessed with a single index, representing the register number. The segment property

(4)

does not play part in the register file access during reading and writing of the data.

Segment Table is organized as a circular queue that maps threads to the ranges of registers allocated to these threads. Segments are allocated in the tail of the queue, but can be deleted in the middle of the queue. This causes register file segmentation, which will be discussed later in the paper. Each segment contains three pieces of information: start value, number of registers, and thread identification. Start value is a pointer into the register file where the first register belonging to this thread is located. Number of registers is the total number of registers used by this thread.

Thread identification points to the location of the thread in the PC Queue. Segments are created for new threads in the Rename Stage. An example of a Segment Table is shown in Figure 1 of Section 6.

4.4 Rename Stage

This stage is responsible for translating the logical registers in the decoded instruction to physical register numbers. Physical numbers are derived from the segment information stored in the Segment Table for each thread. Once the register number in the instruction has been renamed, the original number is never referenced. This property allows for free intermixing of the instructions at the execution core.

This stage is also responsible for allocating register file segments to threads that do not have one. The first instruction of a thread that requires an access to a particular type of register file causes a segment allocation in that register file. To make resource distribution more dynamic, the register file segments are not allocated before the thread starts, and they are not allocated when the instructions of the thread are fetched. This prevents unnecessary resource allocation to the thread creation instruction fetched speculatively after a branch.

If an instruction requires a register, but the segment cannot be allocated, the processor stalls the execution of this instruction’s thread. More discussion on resource conflicts is contained in later sections.

The number of registers required for each thread is encoded in the first instruction of the thread. For example, an instructionADD.D F7, F4, F1, whereF7is the destination register, arrives into the rename stage from thread X. The register allocation logic will query the Segment Queue of the Floating Point Register File for an offset. If thread X has not executed any instructions requiring the Floating Point Register File, the query will fail. The failure will trigger allocation of an eight-register segment in the Floating- Point Register File. The number eight comes from the fact that F7 was the register used as the destination in the first floating point instruction of thread X. Therefore, thread X will require use of registersF0throughF7

.

4.5 Issue and Register Read Stage

This stage reads the data from the register file. If all of the required data is available, the stage issues the instruction into the Execution Queue. Data forwarding logic is used to resolve register data dependencies between non-speculatively fetched instructions from the same thread. If the dependence cannot be resolved, the instruction’s thread is stalled. A stalled thread does not prevent other threads from issuing.

4.6 Execution Queue and Functional Units

This is a unified or separated circular queue, which is used to provide functional units with work. Most of the units are Load/Store, Integer, and Floating Point Arithmetic units. System Units are used to deal with thread control instructions, such as DOALL,TFORK,andBLOCK. The unified version of this queue suffers from a segmentation problem, but provides for a uniform approach to branch misprediction handling and thread control.

Execution of the instruction occurs out of order. Once the instruction is executed it can be retired by the commit stage.

Threads creation is fast, since it only involves entering the program counter value and parameter information into the PC Queue. TheDOALLandTFORKinstructions are not retired until they can successfully create the assigned number of threads. A stalled System instruction could indefinitely prevent thread execution progress. This problem is discussed in Section 6. The number of threads that can be started in a particular execution cycle is limited.

When aHaltor another thread ending instruction is encountered, the System unit deallocates corresponding thread. This operation is fast, since the only operation that needs to be performed is marking of PC and register file segments entries as unused.

4.7 Commit Stage

The instructions are committed in order for each thread. This stage removes the completed instructions from the Execution Queue and the results are written back to the register file. This stage also updates the branch prediction information in BTB and the state information in the PC Queue.

5. ARCHITECTURAL DECISIONS

The distinguishing features of α-Coral are thread control instructions and a segmented register file. This section examines their benefits and drawbacks.

The most significant decision made inα-Coral design was adding new system instructions, and therefore relying on the compiler, to isolate the threads for the processor. This decision corresponds with the recent transition from purely hardware solutions to compiler/instruction set solutions in industry. The addition of threading instructions to the instruction set allows the operating system to easily wrap the existing applications using several thread instructions and execute these applications without recompilation. In addition, single threaded applications can be executed in parallel with other programs. When a single threaded application is executed standalone on theα-Coral type processor, it uses all of the available hardware resources, including the enlarged register files, expanded reorder buffer, and increased number of functional units.

From the point of view of the compiler community,α-Coral will open many new directions. Automatic parallelization is a well- researched area of compilers that has been limited by the fact that thread synchronization between processors is slow and has virtually unbounded delay, since it depends on memory access instructions.

Register file segmentation also requires compiler support, but it has potential to reduce register pressure because it allows the compiler to access a greater number of registers directly.

Ordinarily, the compiler can only access ISA defined registers and the number of addressable registers is limited by the size of the instruction word. In order to provide more registers, the ISA must

(5)

be completely modified. Since α-Coral creates a separate segment for each thread, the compiler is able to manage more registers than each instruction in the ISA can address. This design also allows an existing architecture to be restructured in order to expose the hidden renaming registers to the compiler.

Register file segmentation eliminates the need for traditional hardware register renaming, thus providing faster register file access. Hardware renaming is replaced by compiler register renaming and allocation, which works with a much larger instruction window. Studies [13] have shown that renaming engines employed by current out-of-order processors actually interfere with the compiler renaming optimizations. Thread-level register renaming was investigated early in the project [10].

However, it showed insignificant speedup over segmentation.

The majority of the enhancements to the instruction set are for thread generation. This is beneficial to both the compiler and the operating system. A compiler such as PROMIS [8] that supports hierarchical parallelism in the internal representation can use the TFORKandDOALLinstructions to directly map program sections into hardware threads. With hardware scheduling of low-level thread, the operating system can concentrate on managing larger, process level threads.

The shared register file provides an alternative to a “lock box”[12]

scheme when implementing atomic operations. In addition to location locking, the shared register file allows for simpler memory synchronization primitives like barriers. By using a BLOCK instruction on a shared register, spin waiting is avoided.

However, a drawback of shared register synchronization is poor code scalability for multiprocessors.

6. DESIGN CHALLENGES

Hardware multithreading designs based on superscalar architectures involve many unexpected challenges. Usually, these issues deal with thread switching time and with synchronization delay. Inα-Coral most difficult issue is thread management and resource allocation. These issues are similar to problems encountered in operating system level thread management.

However, a hardware implementation does not have a large pool of resources, such as virtual memory and hard disk drive. In addition, hardware threads cannot be easily swapped out into the memory.

To solve resource conflict and deadlock problems between hardware threads, the designer can create rules for the compiler to follow when creating multithreaded code. On the hardware side, some of the high-cost operations can be performed infrequently, which would make their amortized cost small. Another method used to deal with hardware resource deadlocks is to keep the threads that will not have enough resources from entering the processor, as well as deleting threads that encounter resource conflicts.

6.1 Dynamic resource allocation

In the early stages of the design, the availability of the resources was verified exclusively at allocation time (Rename Stage). To reach the rename stage, a thread is used in two fetching cycles.

Lack of resources in that case would cause the execution to stall because the front side of the processor was filled with threads that could not continue. A decision was made to delete the instructions fetched from these threads in order to allow the progress of threads without new resource requirements to.

However, the problem was not solved. In certain fetch policy configurations, the threads without resources would be fetched and their instructions deleted continuously, without allowing other threads to issue. The logical solution was to mark the threads that could not allocate resources with a particular PC Queue state.

Marking the threads seemed to alleviate most of the stalls, since the fetch unit was able to prefer threads with resources to threads without resources. However, the threads without resources still need to be issued at some point in time in order to allocate the required resources. Therefore a flag was added to the processor, which allowed processing of threads previously marked as failed.

When a thread deallocates a resource, such as a register file segment, this flag is set that allows previously failed threads to issue. When many threads fail at resource allocation, that flag is reset, therefore providing full fetch bandwidth to threads with resources.

6.2 Circular queue fragmentation

PC Queue, Execution Queue, and Segment Tables are required by the design to contain ordered data. Insertion into these queues needs to be performed at the tail, but deletion may occur in the middle. This creates a segmentation situation in which the head and the tail of the queue meet, causing the queue to become

10 - 21 22 - 37 EMPTY

2-9 EMPTY EMPTY EMPTY EMPTY EMPTY Head

Tail

Figure 1b. Full Segment Table, where some slots are empty.

10 - 21 22 - 37 SPECIAL

2-9 EMPTY EMPTY EMPTY EMPTY EMPTY Head

Tail

Figure 1c. Allocation of a special slot.

10 - 21 22 - 37 38-69

2-9 122 - 1 102 - 121

94 - 101 90 - 93 70 - 89 Head

Tail

Figure 1a. Truly Full Segment Table.

Figure 1. Register File Fragmentation.

(6)

unusable even though it is not full. The situation prevents allocation of resources to existing threads as creation of new threads. This, in turn, causes the processor to stall. A simple solution in this situation is to compress the queue by moving queue members closer together, thereby freeing additional resources. However, this is a very expensive operation to do in hardware. Also, compression requirements may render the design unimplementable, since the positions in the queues are required to act as identifiers in other processor structures. Another solution is to perform partial and quick compression, which will clean up only the tail and the head of the queue. This solution is has different implementation for the Execution Queue and Segment Queues.

The BLOCK and DOALL instructions are handled by special structures in this design. In an ordinary out-of-order execution processor, reorder buffer slots will be freed in a finite amount of time, as soon as the functional unit is finished with the instruction.

For thread management instructions, this time is indeterminate. In some cases the resources locked by theBLOCKoperation prevent another thread from executing its instructions, which would unblock the thread. To solve this problem,BLOCK instructions are taken out of the Execution Queue and placed into a separate structure. This structure snoops on the shared register file and unlocks the Program Counter when the block condition is satisfied. The same is done to prevent the DOALLinstructions, which are not able to start the assigned number of threads, from polluting the Execution Queue. The DOALL unit holds the pendingDOALLinstructions and services them as space in the PC Queue becomes available.

If the Execution Queue does not progress for several hundred cycles, a watchdog counter signals a partial compression. Partial compression does not move any items in the queue, but attempts to deallocate empty entries starting from the tail and the head of the queue.

The segment queue presents a more complicated problem. As shown in Figure 1b, the queue could have only two entries, one at the tail and one at the head, while all other entrees have been freed. As mentioned earlier, new allocation is done at the tail of the queue. This case threatens an execution lock, since newly created threads are not able to allocate their registers. This case is treated in two phases. First, the instructions of the failed thread in the Rename stage are deleted from the Fetch and Rename stages.

The number of registers requested by that thread is stored.

Potentially this step will allow other threads to complete and deallocate the necessary registers. In phase two of the resolution, a process is started that searches for a continuous region in the register file and the segment queue, whose size is equal to the number of registers needed by the last thread that failed in segment allocation. To avoid deadlock, if the deletion occurs more then ten times in a row, the watchdog triggers the Rename stage to assign this special segment to the failing thread. The cost of the gap searching operation is amortized, since it is not used often.

A similar technique is used to deal with PC Queue fragmentation.

If an insertion into the PC Queue fails repeatedly for a preset number of times, compression is done at the head and the tail of the queue. In case of further failures, a process is started to seek out an empty slot in the queue. That slot is then assigned to the thread-creating instruction, which previously failed.

6.3 Synchronization deadlocks

The architecture will guarantee forward progress of all the threads that have been created and had their resources allocated.

Therefore, an excessive number of forward data dependences can stall the processor permanently. The situation arises from improperly written code, since it requires unbounded storage, and can be detected by the compiler.

7. SIMULATION ENVIRONMENT AND RESULTS

A simulator was written to execute MIPS IV ISA with theα-Coral extensions. Two SPEC95 benchmarks, Compress95 and Swim95 and a number of small benchmarks were hand compiled for this study. Due to the novelty of the architecture, creating a compiler for this project is just as important as the architectural simulator.

Yet, the benchmarks were assembled by hand, since the compiler is still in the early stages (See Section 8). Simulation of this architecture presents more challenges, since the instruction trace depends on the memory access patterns. Which means that the instruction trace cannot be generated first, and then fed to a separate memory simulator. To expedite development and benchmark execution, the memory simulator implements a simplified caching scheme and the processor simulator does not yet include instruction fetch buffers.

7.1 Benchmark descriptions

Fibonacci Numbers Computation, Matrix Multiply, Binary Tree Search, and Quicksort are small hand-compiled and hand- optimized benchmarks. Separate versions were written for single thread and multithreaded execution, but both versions perform the same task. Compress95 and Swim95 are benchmarks from the Spec95 suite. They were compiled with SGI CC and Fortran77 compilers for the single thread versions, and then the single threaded versions were augmented with threading instructions.

The data set sizes and some array ranges were reduced to allow for reasonable simulation time. The code of the original benchmark was used in its entirety.

Table 1. Instruction Counts in the Benchmarks

Benchmark Instruction Static Dynamic

Fibonacci Single Thread 39 16062

Multithreaded 43 13737

Matrix Multiply Single Thread 34 41017

Combination Single Thread 69 57067

Binary Tree Search Single Thread 195 44415

Quicksort Single Thread 124 57142

Swim 95 Single Thread 3339 7420287

Compress 95 Single Thread 1396 586545

(7)

7.2 Simulation Environment

Two studies were performed with the simulator. The first aims to demonstrate the scalability ofα-Coral architecture and the second to compare various fetch policies. Single thread and multithreaded versions of benchmarks were compared. Tables 1 show the instruction counts for the benchmarks. The percentage speedup was calculated using the following formula:

% 100

% = − ×

ionTime BaseExecut

ime ExecutionT ionTime

BaseExecut SpeedUp

BaseExecutionTime is the execution time of the single thread version of the benchmark on a base (Low) processor configuration.

Speedups were measured as a ratio of execution time against running the same benchmark on the machine with the same total resources, but compiled with a single thread. To demonstrate the potential of this architecture in this paper will show that enabling multithreading allows the program to better utilize the given resources. The tests were performed on four different processor configurations. The configurations ranged from a small two-way architecture (Low) to an 8-issue architecture with an abundance of functional units (Excess).

The processor configurations are listed in Table 2. Not all of the parameters were increased two-fold going from Low configuration to Excess. The major parameters, such as Issue and Commit width do in fact increase by two. In order to maintain realistic bounds, the number of functional units, as well as the number or registers in the High and Excess configuration are similar. Also, all of the configurations use the same memory parameters. Each memory stage has read and write ports simulated.

7.3 Scalability of αααα -Coral

As seen from Figure 2, the multithreaded versions of benchmarks increase in performance with improvements in hardware configuration. At the same time, the single threaded version performance increases slightly or not at all. Matrix Multiply and Swim, respond better to increased resources. This is due to a large number of parallel loops in those benchmarks.

Fibonacci and Compress95 benchmarks are more difficult to parallelize, yet they still respond to increase in hardware.

Benchmarks involving DOACROSS-type loops are parallelized using the BLOCK and ADDS instructions. This causes the hardware to perform “software pipelining” on the loop, allowing for instructions before the synchronization point to issue in parallel, then serializing the execution of the critical sections.

After the execution of the critical sections, threads release their blocks. That provides hardware with some additional parallelism, but not enough to utilize a large number of resources.

In Compress95, some of the loops were executed in parallel with straight-line code in the vicinity of these loops. This accounts for some of speedup.

Most benchmarks demonstrate diminishing returns in the Excess configuration. This is clearly seen from Figures 2(a) and 2(e). In addition, all of the benchmarks demonstrate a performance increase from Low to Mid configurations.

This study concentrated more on the scalability of multithreaded code and less on the speedup of a multithreaded version of a benchmark over a single thread version. Current results demonstrate that multithreaded architecture allows for scalability due to greater exposure of ILP even in a non-multiprogramming workload.

Low Medium High Excess

Name of the parameter Instructions per Cycle or Quantity or Size Delay/Latency

Fetch/Decode/Issue Width: 2 4 6 8

Commit Width: 2 4 6 8

Units:

- ALU w/shifter 1 1 2 2 1cycle

- ALU w/comparator 1 2 4 4 1cycle

- Integer multiplier/divider 1 2 2 3 3 stages, 1cycle

- Branch Unit (Int/FP) 2/1 3/1 4/1 4/1 1 cycles

- FPU (add/sub) 1 2 4 4 2 stages, 2 cycles

- FPU (multiply) 1 2 2 2 4 stages, 2 cycles

- FPU (divide/sqrt) 1 1 2 2 4 stages, 3 cycles

- Load Unit 1 3 6 6 1 stage, 1 cycle if hit

- Store Unit 1 1 2 3 1 stage, 1 cycle

- System Unit 1 2 4 4 2 cycle

Maximum Threads Created/Cycle: 1 2 4 6

Program Counter Queue: 16 32 32 32

Register Files: (Int/FP) 64/64 128/128 256/128 256/128

Segments: 16 32 32 32

Branch Target Buffer Entries: 64 128 256 256

Memory System: Size and Associativity R/W Ports Delay/Latency

- L1 I-Cache 4096 words 4 way-assoc 4/2 1 cycle if hit

- L1 D-Cache 4096 words 2 way-assoc 2/2 1 cycle if hit read/write

- L2 Cache 8192 words 4 way-assoc 4/2 10 cycle if hit

- Memory 6000000 words 2/2 50 cycles if hit

Table 2. Simulation Parameters

(8)

7.4 Fetch policy

In order to determine which of the fetch policies described in the earlier section are most beneficial, several benchmarks were run, each representing a particular type of code (Figure 3). High processor configuration was used. The Size parameter was varied between one, two, four, and six instructions fetched from each thread. The Order parameter was varied between three options:

preferring the last successfully fetched thread (Current), preferring the head of the Program Counter Queue (Head), and preferring the next thread after the last successfully fetched thread (Next).

Since α-Coral extracts speedup by exploiting the parallelism between threads, interleaving threads maximally (i.e., fetching

one instruction from each thread), should have been the best policy. In addition, using the Next method should have increased thread mix in the execution engine. However, different types of benchmarks performed best under different fetch policies.

The Combination benchmark represents a workload with a mix of flow-dependent and fully parallel loops. This benchmark performed best when one instruction was fetched from each thread. In addition, the Next fetching policy seems to bring best performance. In contrast, for the Compress95 benchmark, fetching up to the maximum bandwidth (six) worked best. This is due to the fact that Compress95 usually only contains one or two threads. During execution, if one of the threads reaches a blocking state before another, then fetching from a blocking

(9)

thread negatively impacts the overall performance. The Swim benchmark represents a highly parallel class of programs. The greatest benefit came from Next/one and Next/two fetching policies. For Swim, all of the policies worked well, except for six-instruction fetch. Two-instruction fetch using the Next method led to the best average performance among all of the benchmarks.

8. COMPILER IMPLEMENTATION

Past compiler research concentrates on two areas, Instruction Level Parallelism (ILP) and Parallel Programming for multiprocessor. Even though the compilation for Hardware Multithreading (HMT) involves the same basic principle as compilation for multiprocessor there are several important differences that provide compilers with more flexibility. This flexibility stems from the choice of creating a very tight ILP schedule, or creating small sized threads.

HMT architecture provides several more advantages for the parallelizing compiler. Load balancing requirements are relaxed.

This allows for threads to vary in length and creation time. The load is balanced by the hardware at runtime. Generating threads is a simpler task since various size of parallelism discovered during the high level examination of the program could be marked and later scheduled as threads. The interference between these threads does not need to be considered during register allocation since each thread has a private register file. Synchronization introduces less overhead, since it can be done through shared registers.

PROMIS [8] compiler uses top down traversal of a hierarchical task graph [3] to isolate threads. This allows for thread generation from large sections of code, as well as loops and basic blocks.

Synchronization variables and loop reduction variables are mapped onto α-Coral shared registers by the register allocation.

Thread generation and synchronization instructions are initially encoded as function calls. The assembly code generation phase of PROMIS is driven by the α-Coral Universal Machine Descriptor [4]. The phase is able to emit α-Coral specific instructions in place of these function calls.

9. CONCLUSION

The paper introduced α-Coral, a compiler driven, hardware multithreaded architecture. The architecture was based on a conventional superscalar and it is similar to SMT [11] design.

The instruction set was augmented with thread management instructions. α-Coral has found several directions of hardware multithreading not included in SMT, which can improve performance. The instruction set modifications resulted in a more flexible, easier to target, low-level multithreaded implementation.

With recompilation, a single application will potentially execute faster. Even without recompilation, a legacy application may execute faster, since it will use all of the newly available resources.

The results of running several benchmarks demonstrate the advantages of hardware multithreading in discovering parallelism and utilizing a potential abundance of functional units. This was demonstrated by the performance scalability of α-Coral with the increase in the number of resources. A variety of fetch policies were investigated to demonstrate the behavior of different codes inα-Coral.

Future improvements in PROMIS and theα-Coral simulator will allow for further research of hardware multithreading-specific compiler optimizations and architectural design.

10. REFERENCES

[1] Agarwal, A. et al. The MIT Alewife machine: A large-scale distributed memory multiprocessor. In Proceedings of the 1^st Workshop on Scalable Shared Memory Multiprocessors, 1991, pp. 239-262.

[2] Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Portereld, A., and Smith, B. The Tera computer system. In International Conference on Supercomputing, pages 1-6, June 1990.

[3] Girkar, M. and Polychronopoulos, C. D. (1994). The Hierarchical Task Graph as a Universal Intermediate Representation. International Journal of Parallel Programming, 22(5):519-551

[4] Ko, W., The Promis Universal Machine Descriptor:

Concepts, Design, and Implementation, M.S. thesis, University of Illinois, Urbana, Illinois, 2001.

[5] Kuehn, J. T. and Smith, B. J., The Horizon supercomputing system: Architecture and software, in Proceedings of Supercomputing, 1988, pp. 28-34.

[6] Lo, J., Eggers, S., Levy, H., Parekh, S., and Tullsen, D., Tuning Compiler Optimizations for Simultaneous Multithreading In 30th Annual International Symposium on Microarchitecture (Micro-30), Dec. 1-3, 1997, p. 114-124.

(10)

[7] Lo, J., Parekh, S., Eggers, S., Levy, H., Tullsen, D., Software-Directed Register Deallocation for Simultaneous Multithreaded Processors, University of Washington Technical Report #UW-CSE-97-12-01, December 1997.

[8] Saito, Hideki, Stavrakos, Nicholas, Carroll, Steve, Constantine Polychronopoulos, and Alex Nicolau. The design of the PROMIS compiler. In Proceedings of the International Conference on Compiler Construction (CC), March 1999.

[9] Smith, B. J. “Architecture and applications of the HEP multiprocessor computer system,” in SPIE Real Time Signal Processing IV, 1981, pp. 241-248.

[10] Tsien, B., Theα-Coral Register Renaming Algorithm, M.S.

thesis, University of Illinois, Urbana, Illinois, 1998.

[11] Tullsen, D., Eggers, S., and Henry, L., Simultaneous Multithreading: Maximizing On-Chip Parallelism,

Proceedings of the 22rd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, pages 392-403, June 1995.

[12] Tullsen, D., Lo, J., Eggers, S., and Levy, H., Supporting Fine-Grain Synchronization on a Simultaneous Multithreaded Processor, 5th International Symposium on High Performance Computer Architecture, January 1999.

[13] Valluriy, M.G., Govindarajanyz, R. Evaluating Register Allocation and Instruction Scheduling Techniques in Out-Of- Order Issue Processors (ICS’99)

[14] Wallace, S., Calder, B., and Tullsen, D., Threaded Multiple Path Execution, Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA'98), June 29- July 1, 1998.