The State of Affairs - Optimal Global Instruction Scheduling for the Itanium® Processor Archite

Since its introduction in 2001, the Itanium Processor Family has established itself as one of the leading processor architectures with respect to performance. However, the superiority in principle of “Intel’s huge bet” [ML02] over conventional RISC architectures remains in parts to be proven.

A recent study [Alp03, MK02] has compared various characteristics of the code produced by Intel’s C++ compiler 6.0 for the Itanium 2 with that of a classic RISC, the Alpha 21264 (using the Compaq compilers). Basis was the SPEC CPU2000 benchmark [SPE00]. On this benchmark, the Itanium 2 and the Alpha 21264 achieve SPECint/SPECfp base rates of 683/1396 and 621/776, respectively (both at 1 GHz, with the compilers mentioned above) [SPE00]. In the following, all numbers reported from the study refer to the dynamic traces of instructions executed during the benchmark runs. Profiling was used on the Itanium 2, but not on the Alpha.

The traces show that the total number of executed instructions, including nops, is about 20%

greater on the Itanium. Without nops, however, the numbers for both architectures are almost equal. The restrictive bundling scheme is the main cause of extra nops. Together with the larger instruction encodings of IA-64 (due to the larger number of registers, predication, and template bits), the total size of the fetched instructions is here about 60% larger than on the Alpha, which inevitably decreases the instruction cache efficiency. Nevertheless, the 16 KB instruction cache is sufficient for SPECint and SPECfp with their relatively high instruction cache locality: the performance loss due to instruction access stalls is just 3% and 1%, respectively. However, for the larger code working sets of server applications, numbers like 31% were reported for the first generation Itanium [Li01].

An analysis of the instruction mix (without nops) shows that the Itanium needs 40% fewer memory operations and 30% fewer branches than the Alpha, but 10% more ALU operations, shifts, and compares. This indicates that features like the large number of architected registers, the register stack engine (RSE), and predication are successful in reducing the number of “hard”, stall-inducing instructions like loads and branches. Especially the RSE is seen as an “effective performance enhancement” [Alp03] as it manages procedure calls with very low overhead: The time spent on RSE activity is only about 2% of all cycles (1.5-3 cycles per call/return pair on average).

The effectiveness of predication is more arguable: An earlier study conducted with the first-generation Itanium confirms that if-conversion via predication can reduce the number of mis-predicted branches by 29%—but it improved performance by only 2%, which lags far behind predictions of more than 30% from earlier research studies [CKGN01]. One reason for this is that these studies assumed fewer and less sophisticated branch execution and prediction resources than what materialized on the first-generation Itanium processor. The penalty for mispredicted branches is there only 7% on the SPECint benchmark, which bounds the benefit from removing them. However, this number is likely to increase if the pipeline is stretched on future imple-mentations in order to achieve higher frequencies—hence it is too early for a final verdict on predication. Furthermore, predication is also used by software pipelining: While the benefit of the latter transformation is negligible for SPECint (1%), it is dramatic for SPECfp (more than 30%).

38%

49%

Pipeline Flush Data Access Instruction Access RSE Activity Scoreboard Unstalled

Figure 2.13: Breakdown of the execution time for SPECint 2000 [MK02].

The study on the SPEC benchmarks on the Itanium 2 also includes an analysis of the com-piler’s ability to extract instruction-level parallelism [Alp03, MK02]. The found static instruc-tions per clock rate (without nops and predicated off instrucinstruc-tions) for SPECint is 2.5 on average, far from the maximum rate of 6 IPC the processor was designed for. At runtime, stalls caused by cache/TLB misses and other hazards almost halve this average rate to 1.3 IPC. In other words, the unstalled execution time is about half of the total execution time. This is depicted in Fig. 2.13, where “Data Access” denotes the stalls due to cache-missing loads and “Scoreboard” all stalls due to other long-latency instructions. The authors also investigated the region sizes in the trace and measured 19 useful instructions per taken branch, 159 useful instructions per mispredicted branch and 212 useful instructions per call on average.

While the study did not analyze the benefit of speculation, it measured that about 24% of loads use control speculation, but only 4.5% data speculation. The failure rates are very low with less than 0.001% and 1% for control and data speculation, respectively. This shows that the compiler has the cost of speculation well under control, but it leaves open whether this is merely due to conservative compiler heuristics that restrict the application of this feature.

Overall, these numbers paint a rather positive picture: the Intel compiler can transform the architecture’s new features into performance that is “highly competitive with the best RISC pro-cessors” [Alp03]. But the high percentage of nops and the low static IPC indicate that some central visions of the architecture’s inventors have not (yet) come true. Intel’s compiler team has identified the following top-five challenges [Li01]:

1. Managing data caches/DTLB for acyclic code 2. Managing instruction cache/ITLB

3. More effective use of control speculation 4. More effective use of data speculation

5. Creative use of predication

This work tackles all of these items, especially the last three, and it also points out another candidate for the list: More effective global instruction scheduling.

The Global Instruction Scheduling

Problem

In document Optimal Global Instruction Scheduling for the Itanium® Processor Architecture (Page 63-67)