Memory Reference - Intel Itanium Architecture Software Developer s Manual

3.1 Overview

Memory latency is a major factor in determining the performance of integer applications. In order to help reduce the effects of memory latency, the Itanium architecture explicitly supports software pipelining, large register files, and

compiler-controlled speculation. This chapter discusses features and optimizations related to compiler-controlled speculation. See Chapter 5, “Software Pipelining and Loop Support” for a complete description of how to use software pipelining.

The early sections of this chapter review non-speculative load and store in the Itanium architecture, and general concepts and terminology related to data dependencies. The concept of speculation is then introduced, followed by discussions and examples of how speculation is used. The remainder of this chapter describes several important

optimizations related to memory access and instruction scheduling.

3.2 Non-speculative Memory References

The Itanium architecture supports non-speculative loads and stores, as well as explicit memory hint instructions.

3.2.1 Stores to Memory

Itanium integer store instructions can write either 1, 2, 4, or 8 bytes and 4, 8, or 10 bytes for floating-point stores. For example, a st4 instruction will write the first four bytes of a register to memory.

Although the Itanium architecture uses a little endian memory byte order by default, software can change the byte order by setting the big endian (be) bit of the user mask (UM).

3.2.2 Loads from Memory

Itanium integer load instructions can read either 1, 2, 4, or 8 bytes from memory depending on the type of load issued. Loads of 1, 2, or 4 bytes of data are zero-extended to 64-bits prior to being written into their target registers.

Although loads are provided for various data types, the basic data type is the quadword (8 bytes). Apart from a few exceptions, all integer operations are on quadword data. This can be particularly important when dealing with signed integers and 32-bit addresses, or any addresses that are shorter than 64 bits.

3.2.3 Data Prefetch Hint

The lfetch instruction requests that lines be moved between different levels of the memory hierarchy. Like all hint instructions defined in the Itanium architecture, lfetch has no effect on program correctness, and any microarchitecture implementation may choose to ignore it.

3.3 Instruction Dependencies

Data and control dependencies are fundamental factors in optimization and instruction scheduling. Such dependencies can prevent a compiler from scheduling instructions in an order that would yield shorter critical paths and better resource usage since they restrict the placement of instructions relative to other instructions on which they are dependent.

In general, memory references are the major source of control and data dependencies that cannot be broken due to getting a wrong answer (if a data dependency is broken) or raising a fault that should not be raised (if a control dependency is broken). This section describes:

• Background material on memory reference dependencies.

• Descriptions of how dependencies constrain code scheduling on traditional architectures.

Section 3.4 describes memory reference features defined in the Itanium architecture that increase the number of dependencies that can be removed by a compiler.

3.3.1 Control Dependencies

An instruction is control dependent on a branch if the direction taken by the branch affects whether the instruction is executed. In the code below, the load instruction is control dependent on the branch:

(p1)br.cond some_label ld8 r4=[r5]

The following sections provide overviews of control dependencies and their effects on optimization.

3.3.1.1 Instruction Scheduling and Control Dependencies

The code below contains a control dependency at the branch instruction:

add r7=r6,1 // Cycle 0 add r13=r25,r27 cmp.eq p1,p2=r12,r23 (p1) br.cond some_label ;; ld4 r2=[r3];; // Cycle 1 sub r4=r2,r11 // Cycle 3

A compiler cannot safely move the load instruction before the branch unless it can guarantee that the moved load will not cause a fatal program fault or otherwise corrupt program state. Since the load cannot be moved upward, the schedule cannot be improved using normal code motion.

Thus, the branch creates a barrier to instructions whose execution depends upon it. In Figure 3-1, the load in block B cannot be moved up because of a conditional branch at the end of block A.

3.3.2 Data Dependencies

A data dependency exists between an instruction that accesses a register or memory location and another instruction that alters the same register or location.

3.3.2.1 Basics of Data Dependency

The following basic terms describe data dependencies between instructions: • Write-after-write (WAW)

A dependency between two instructions that write to the same register or memory location.

• Write-after-read (WAR)

A dependency between two instructions in which an instruction reads a register or memory location that a subsequent instruction writes.

• Read-after-write (RAW)

A dependency between two instructions in which an instruction writes to a register or memory location that is read by a subsequent instruction.

• Ambiguous memory dependencies

Dependencies between a load and a store, or between two stores where it cannot be determined if the involved instructions access overlapping memory locations. Ambiguous memory references include possible WAW, WAR, or RAW dependencies. • Independent memory references

References by two or more memory instructions that are known not to have conflicting memory accesses.

Figure 3-1. Control Dependency Preventing Code Motion

Block A

Block B br

3.3.2.2 Data Dependency in the Intel® Itanium® Architecture

The Itanium architecture requires the programmer to insert stops between RAW and WAW register dependencies to ensure correct code results. For example, in the code below, the add instruction computes a value in r4 needed by the sub instruction:

add r4=r5,r6 ;; // Instruction group 1 sub r7=r4,r9 // Instruction group 2

The stop after the add instruction terminates one instruction group so that the sub instruction can legally read r4.

On the other hand, implementations based on the Itanium architecture are required to observe memory-based dependencies within an instruction group. In a single

instruction group, a program can contain memory-based data dependent instructions and hardware will produce the same results as if the instructions were executed sequentially and in program order. The pseudo-code below demonstrates a memory dependency that will be observed by hardware:

mov r16=1

mov r17=2 ;;

st8 [r15]=r16

st8 [r14]=r17;;

If the address in r14 is equal to the address in r15, uni-processor hardware guarantees that the memory location will contain the value in r17 (2). The following RAW

dependency is also legal in the same instruction group even if software is unable to determine if r1 and r2 overlap:

st8 [r1]=x

ld4 y=[r2]

3.3.2.3 Instruction Scheduling and Data Dependencies

The dependency rules are sufficient to generate correct code, but to generate efficient code, the compiler must take into account the latencies of instructions. For example, the generic implementation has a two cycle latency to the first level data cache. In the code below, the stop maintains correct ordering, but a use of r2 is scheduled only one cycle after its load:

add r7=r6,1 // Cycle 0 add r13=r25,r27 cmp.eq p1,p2=r12,r23;; add r11=r13,r29 // Cycle 1 ld4 r2=[r3];; sub r4=r2,r11 // Cycle 3

Since the latency of a load is two cycles, the sub instruction will stall until cycle three. To avoid a stall, the compiler can move the load earlier in the schedule so that the machine can perform useful work each cycle:

ld4 r2=[r3] // Cycle 0 add r7=r6,1 add r13=r25,r27 cmp.eq p1,p2=r12,r23;; add r11=r13,r29;; // Cycle 1 sub r4=r2,r11 // Cycle 2

In this code, there are enough independent instructions to move the load earlier in the schedule to make better use of the functional units and reduce execution time by one cycle.

Now suppose that the original code sequence contained an ambiguous memory dependency between a store instruction and the load instruction:

add r7=r6,1 // Cycle 0 add r13=r25,r27 cmp.ne p1,p2=r12,r23;; st4 [r29]=r13 // Cycle 1 ld4 r2=[r3];; sub r4=r2,r11 // Cycle 3

In this case, the load cannot be moved past the store due to the memory dependency. Stores will cause data dependencies if they cannot be disambiguated from loads or other stores.

In the absence of other architectural support, stores can prevent moving loads and their dependent instructions: The following C language statements could not be reordered unless ptr1 and ptr2 were statically known to point to independent memory locations:

*ptr1 = 6; x = *ptr2;

3.4 Using Speculation in the Intel® Itanium®

In document Intel Itanium Architecture Software Developer s Manual (Page 158-162)