Introduction to Programming for the Intel ®
2.4 Memory Access and Speculation
The Itanium architecture provides memory access only through register load and store instructions and special semaphore instructions. The architecture also provides
extensive support for hiding memory latency via programmer-controlled speculation.
2.4.1
Functionality
Data and instructions are referenced by 64-bit addresses. Instructions are stored in memory in little endian byte order, in which the least significant byte appears in the lowest addressed byte of a memory location. For data, modes for both big and little endian byte order are supported and can be controlled by a bit in the User Mask Register.
Integer loads of one, two, and four bytes are zero-extended, since all 64 bits of each register are always written. Integer stores write one, two, four, or eight bytes of registers to memory as specified.
2.4.2
Speculation
Speculation allows a programmer to break data or control dependencies that would normally limit code motion. The two kinds of speculation are called control speculation and data speculation. This section summarizes speculation in the Itanium architecture. See Chapter 3, “Memory Reference” for more detailed descriptions of speculative instruction behavior and application.
2.4.3
Control Speculation
Control speculation allows loads and their dependent uses to be safely moved above branches. Support for this is enabled by special NaT bits that are attached to integer registers and by special NatVal values for floating-point registers. When a speculative load causes an exception, it is not immediately raised. Instead, the NaT bit is set on the destination register (or NatVal is written into the floating-point register). Subsequent speculative instructions that use a register with a set NaT bit propagate the setting until a non-speculative instruction checks for or raises the deferred exception.
For example, in the absence of other information, the compiler for a typical RISC architecture cannot safely move the load above the branch in the sequence below:
(p1) br.cond.dptk L1 // Cycle 0 ld8 r3=[r5];; // Cycle 1 shr r7=r3,r87 // Cycle 3
Supposing that the latency of a load is 2 cycles, the shift right (shr) instruction will stall for 1. However, by using the speculative loads and checks provided in the Itanium architecture, two cycles can be saved by rewriting the above code as shown below:
ld8.s r3=[r5] // Earlier cycle // Other instructions
(p1) br.cond.dptk L1;; // Cycle 0 chk.s r3,recovery // Cycle 1 shr r7=r3,r87 // Cycle 1
This code assumes r5 is ready when accessed and that there are sufficient instructions to fill the latency between the ld8.s and the chk.s.
2.4.4
Data Speculation
Data speculation allows loads to be moved above possibly conflicting memory references. Advanced loads exclusively refer to data speculative loads. Review the order of loads and stores in this assembly sequence:
st8 [r55]=r45 // Cycle 0 ld8 r3=[r5] ;; // Cycle 0 shr r7=r3,r87 // Cycle 2
The Itanium architecture allows the programmer to move the load above the store even if it is not known whether the load and the store reference overlapping memory locations. This is accomplished using special advanced load and check instructions:
ld8.a r3=[r5] // Advanced load // Other instructions
st8 [r55]=r45 // Cycle 0
ld8.c r3=[r5] // Cycle 0 - check shr r7=r3,r87 // Cycle 0
Note: The shr instruction in this schedule could issue in cycle 0 if there were no con- flicts between the advanced load and intervening stores. If there were a con- flict, the check load instruction (ld8.c) would detect the conflict and reissue the load.
2.5
Predication
Predication is the conditional execution of an instruction based on a qualifying predicate. A qualifying predicate is a predicate register whose value determines whether the processor commits the results computed by an instruction.
The values of predicate registers are set by the results of instructions such as compare (cmp) and test bit (tbit). When the value of a qualifying predicate associated with an instruction is true (1), the processor executes the instruction, and instruction results are committed. When the value is false (0), the processor discards any results and raises no exceptions. Consider the following C code:
if (a) { b = c + d; } if (e) { h = i + j; }
This code can be implemented in the Itanium architecture using qualifying predicates so that branches are removed. The pseudo-code shown below implements the C
expressions without branches:
cmp.ne p1,p2=a,r0 // p1 <- a!= 0 cmp.ne p3,p4=e,r0 ;; // p3 <- e != 0
(p1)add b=c,d // If a!= 0 then add
(p3)sub h=i,j // If e!= 0 then sub
See Chapter 4, “Predication, Control Flow, and Instruction Stream” for detailed discussion of predication. There are a few special cases where predicated instructions read or write architectural resources regardless of their qualifying predicate.
2.6
Architectural Support for Procedure Calls
Calling conventions normally require callee and caller saved registers which can incur significant overhead during procedure calls and returns. To address this problem, a subset of the Itanium general registers are organized as a logically infinite set of stack frames that are allocated from a finite pool of physical registers.
2.6.1
Stacked Registers
Registers r0 through r31 are called global or static registers and are not part of the stacked registers. The stacked registers are numbered r32 up to a user-configurable maximum of r127.
A called procedure specifies the size of its new stack frame using the alloc instruction. The procedure can use this instruction to allocate up to 96 registers per frame shared amongst input, output, and local values. When a call is made, the output registers of the calling procedure are overlapped with the input registers of the called procedure, thus allowing parameters to be passed with no register copying or spilling.
The hardware renames physical registers so that the stacked registers are always referenced in a procedure starting at r32.
2.6.2
Register Stack Engine
Management of the register stack is handled by a hardware mechanism called the Register Stack Engine (RSE). The RSE moves the contents of physical registers between the general register file and memory without explicit program intervention. This provides a programming model that looks like an unlimited physical register stack to compilers; however, saving and restoring of registers by the RSE may be costly, so compilers should still attempt to minimize register usage.
2.7
Branches and Hints
Since branches have a major impact on program performance, the Itanium architecture includes features to improve their performance by:
• Using predication to reduce the number of branches in the code. This improves instruction fetching because there are fewer control flow changes, decreases the number of branch mispredicts since there are fewer branches, and it increases the branch prediction hit rates since there is less competition for prediction resources. • Providing software hints for branches to improve hardware use of prediction and
prefetching resources.
• Supplying explicit support for software pipelining of loops and exit prediction of counted loops.
2.7.1
Branch Instructions
Branching in the Itanium architecture is largely expressed the same way as on other microprocessors. The major difference is that branch triggers are controlled by
predicates rather than conditions encoded in branch instructions. The architecture also provides a rich set of hints to control branch prediction strategy, prefetching, and specific branch types like loops, exits, and branches associated with software pipelining. Targets for indirect branches are placed in branch registers prior to branch instructions.
2.7.2
Loops and Software Pipelining
Compilers sometimes try to improve the performance of loops by using unrolling. However, unrolling is not effective on all loops for the following reasons:
• Unrolling may not fully exploit the parallelism available.
• Unrolling is tailored for a statically defined number of loop iterations. • Unrolling can increase code size.
To maintain the advantages of loop unrolling while overcoming these limitations, the Itanium architecture provides architectural support for software pipelining. Software pipelining enables the compiler to interleave the execution of several loop iterations without having to unroll a loop. Software pipelining is performed using:
• Loop-branch instructions. • LC and EC application registers.
• Rotating registers and loop stage predicates.
• Branch hints that can assign a special prediction mechanism to important branches. In addition to software pipelined while and counted loops, the architecture provides particular support for simple counted loops using the br.cloop instruction. The cloop branch instruction uses the 64-bit Loop Count (LC) application register rather than a qualifying predicate to determine the branch exit condition.
For a complete discussion of software pipelining support, see Chapter 5, “Software Pipelining and Loop Support.”
2.7.3
Rotating Registers
Rotating registers enable succinct implementation of software pipelining with
predication. Rotating registers are rotated by one register position each time one of the special loop branches is executed. Thus, after one rotation, the content of register X will be found in register X+1 and the value of the highest numbered rotating register
will be found in r32. The size of the rotating region of general registers can be any multiple of 8 and is selected by a field in the alloc instruction. The predicate and floating-point registers can also be rotated but the number of rotating registers is not programmable: predicate registers p16 through p63 are rotated, and floating-point registers f32 through f127 are rotated.
2.8
Summary
The Itanium architecture provides features that reduce the effects of traditional microarchitectural performance barriers by enabling:
• Improved ILP with a large number of registers and software scheduling of instruction groups and bundles.
• Better branch handling through predication.
• Reduced overhead for procedure calls through the register stack mechanism. • Streamlined loop handling through hardware support of software pipelined loops. • Support for hiding memory latency using speculation.