Instruction Set Overview - Fundamentals and Basic Notions

1.3 Fundamentals and Basic Notions

2.1.3 Instruction Set Overview

The IA-64 instruction set is large and diversified; only the most common instructions are listed in the following tables with their syntax and semantics. For most instructions, the semantics is described in a C-like pseudo code. Technically similar instructions that execute on the same units with the same latency on Itanium processors are arranged in groups.

Table 2.1 lists A-type instructions: integer addition, subtraction, a shift-left-and-add in-struction used for address computations, logic operations, and a compare inin-struction in many variations. All these instructions (except for the logic instructions) exist also in SIMD variants (“multimedia instructions”) that treat the general registers as concatenations of eight 8-bit, four

withn instructions inherently must contain at least n modulo 3 nops. If the basic block sizes are assumed to be distributed randomly, this results in one additional nop per basic block on average.

Group Syntax Semantics

IALU add r1=r2,r3 r1=r2+r3

add r1=r2,r3,1 r1=r2+r3+1 add r1=imm,r3 r1=imm+r3

sub r1=r2,r3 r1=r2-r3 sub r1=r2,r3,1 r1=r2-r3-1

sub r1=imm,r3 r1=imm-r3 shladd r1=r2,imm,r3 r1=(r2< <imm)+r3 ILOG and r1=r2,r3 r1=r2&r3

and r1=imm,r3 r1=imm&r3 andcm r1=r2,r3 r1=r2&~r3 andcm r1=imm,r3 r1=imm&~r3

or r1=r2,r3 r1=r2|r3

or r1=imm,r3 r1=imm|r3 xor r1=r2,r3 r1=r2^r3 xor r1=imm,r3 r1=imm^r3 ICMP cmp.CR.CT p1,p2=r2,r3 p1=(r2CR r3)

r2 can also be ’imm’ p2=~(r2CR r3) CR=eq,ne,lt,le,gt,ge CR=!=,=,<,<=,>,>=

ltu,leu,gtu,geu u = unsigned CT=ε,unc,or, See text and,or.andcm

MMALU_A paddX.COMP r1=r2,r3 e1=e2+e3 X=1,2,4 SIMD element size COMP=ε,sss,uuu,uus See manual psubX.COMP r1=r2,r3 e1=e2-e3

X=1,2,4 SIMD element size COMP=ε,sss,uuu,uus See manual pavgX.COMP r1=r2,r3 e1=(e2+e3+1)> >1

X=1,2 SIMD element size COMP=ε, raz See manual pavgsubX r1=r2,r3 See manual

X=1,2 SIMD element size pshladd2 r1=r2,c3,r3 e1=(e1< <c3)+e3 pshradd2 r1=r2,c3,r3 e1=(e1> >c3)+e3 pcmpX.PR r1=r2,r3 e1=(e2PR e3)

X=1,2 PR=eq,gt See manual

Table 2.1: A-type instructions.

16-bit, or two 32-bit elements. They perform the operation on each of these elements (denoted by e in the table) independently and in parallel.

In addition, more complex SIMD operations are available as I-type instructions (Tab. 2.3):

these include parallel multiply, parallel shift and a combination of both, as well as highly special-ized parallel minimum and maximum operations, and pack and unpack instructions (which con-vert between different element sizes). The I-type instructions also comprise several non-SIMD shift instructions that shift the value of a general register by an amount specified by another gen-eral register (variable shift) or an encoded constant (fixed shift). While the variable shifts shr and shl are technically similar to SIMD instructions, the fixed shifts are actually performed by the more general shift-and-mask instructions dep and extr, which move bit fields to different bit positions. Further I-type instructions transfer values between different register files.

The M-type instructions (Tab. 2.4) include loads, stores, and the prefetch instruction lfetch.

The latter can be employed by the compiler to move the addressed line to a location in the mem-ory hierarchy in order to speed up future expected accesses to this line. The effect of lfetch is comparable to a load without a destination register (there are other implementation-specific differences). The intended destination location inside the memory hierarchy is specified by a lo-cality hint (given by a completer): for instance, the nt1 completer indicates that the data should not be prefetched into the highest level of the cache hierarchy, but to all lower levels. This can be used to prevent the congestion (“pollution”) of a small L1 cache if large amounts of data are prefetched. Loads and stores support these locality hints, too. They do not affect the functional behavior of the program, but the performance in an implementation-specific manner.

All memory instructions support post-increment, i.e., an additional source operand that is added to the address register after the memory access. Both immediate and register post-increment are defined for loads and prefetches; stores, however, only allow immediate post-increment (oth-erwise there would be three source registers).

The getf and setf instructions are used to transfer integers from and to floating-point registers, respectively.

The B-type instructions (Tab. 2.5) comprise IP-relative branches, calls and returns (ex-plained in Sec. 2.1.1.3), and indirect branches, which use branch registers to specify the branch target address. Most of these branches can be made conditional via predication. The compiler can also encode a branch hint, a completer that signals whether the branch should be predicted taken (dptk, sptk) or not-taken (dpnt, spnt). While the dpxx completers only predefine the branch direction for the cases where dynamic branch prediction information is not yet available, the spxx completers indicate that no dynamic prediction resources should be allocated at all for a branch (this can be used to mark branches that are most likely to be not-taken, for example to error handlers).

These branch hints can also be provided earlier in the code by specific branch predict instruc-tions, along with information about the location and the target address of the upcoming branch. If scheduled several cycles before the actual branch, this information can be used by the processor to prepare the branch execution, for instance by prefetching instructions from the branch target address into the instruction cache.

The floating-point (F-type) (Tab. 2.6) arithmetic instructions support the internal 82-bit floating-point register format as well as single, double or double-extended real formats according

Group Syntax Semantics ISHF dep r1=r2,r3,p,len Deposits bit fields

dep r1=imm,r3,p,len See manual dep.z r1=r2,p,len Variant with r3=0 dep.z r1=imm,p,len See manual

extr r1=r3,p,len Extracts bit fields extr.u r1=r3,p,len See manual

FRBR mov r1=b1 Reads branch registers TOBR mov b1=r1 Writes branch registers

FRAR mov r1=lc Reads the registers

mov r1=pfs lc and pfs

TOAR mov lc=r1 Writes the registers

mov ec=r1 lc, ec and pfs

mov pfs=r1 imm operand also possible FRPR mov r1=pr Reads predicate registers TOPR mov pr.rot=imm Writes predicate registers CHK_I chk.s r2,target Control speculation check

TBIT tbit.R.CT p1,p2=r3,p Tests if bit p in r3 R=nz,z is 1 (nz) or 0 (z) CT as with cmp

tnat.R.CT p1,p2=r3 Tests NaT bit of r3 MMALU_I pmax1.u r1=r2,r3 e1=max_unsigd(e2,e3)

pmax2 r1=r2,r3 e1=max(e2,e3) pmin1.u r1=r2,r3 e1=min_unsigd(e2,e3)

pmin2 r1=r2,r3 e1=min(e2,e3) MMMUL pmpy2.r r1=r2,r3 Parallel multiply

pmpy2.l r1=r2,r3 See manual

pmpyshr2 r1=r2,r3,c2 Parallel multiply and shift pmpyshr2.u r1=... See manual

MMSHF packX.sss r1=r2,r3 See manual X=2,4

unpackX.C r1=r2,r3 See manual X=2,4 C=h,l

pshrX[.u] r1=r2,r3 e1=(e2> >r3) arithmetic pshrX[.u] r1=r2,imm e1=(e2> >imm) arithmetic

X=2,4 u=unsigned

pshlX r1=r2,r3 e1=(e2< <r3) pshlX r1=r2,imm e1=(e2< <imm)

X=2,4

shr r1=r2,r3 r1=(r2> >r3) arithmetic shr.u r1=r2,r3 r1=(r2> >r3) unsigned

shl r1=r2,r3 r1=(r2< <r3) XTD sxt/zxt/czx r1=r2 Sign extension

Table 2.3: I-type instructions.

Group Syntax Semantics

LD ldX.LDT r1=[r3] r1=mem(r3)

ldX.LDT r1=[r3],r2 Post incr. r3 += r2 ldX.LDT r1=[r3],imm Post incr. r3 += imm

X=1,2,4,8 Data size

LDT=ε,s,a,sa,c, Completers for speculation

c.clr,fill See text

FLD ldffsz.LDT f1=[r3] f1=mem(r3)

fsz=s,d,e Data size

ST stX [r3]=r2 mem(r3)=r2

stX [r3]=r2,imm Post incr. r3 += imm

X=1,2,4,8 Data size

st8.spill [r3]=... See manual

LFETCH lfetch [r3] Prefetch

lfetch [r3],r2 Post incr. r3 += r2 lfetch [r3],imm Post incr. r3 += imm

FRFR getf.sig r1=f2 r1=f2

TOFR setf.sig f1=r2 f1=r2

ALLOC alloc r1=ar.pfs,i,l,o,r See text

Table 2.4: M-type instructions.

Group Syntax Semantics

BR br.BT.BW target Branch

br.BT.BW b1=target call form br.BT.BW b2 indirect form BT=cond,call,ret, See text cloop, ctop, cexit,

wtop,wexit

BW =spnt,sptk, Branch Hints

dpnt,dptk See Manual

RSE_B clrrrb Clear RRB

clrrrb.pr See Manual

BRP brp.ipwh.ih target,tag Branch Predict ipwh = sptk, loop, exit, dptk Branch information

ih=ε,imp See text

Table 2.5: B-type instructions.

to the IEEE standard [HP03]. The table lists only a small selection of all floating-point instruc-tions; many exist also in SIMD variants that treat the register’s 64-bit significands as a pair of IEEE single precision values. The result range and precision are determined either statically via the instruction’s completer, or dynamically via the precision-control and widest-range-exponent fields in the floating-point status register (FPSR).

In the latter case, each instruction refers to one of the four identical status fields sf0-sf3 inside FPSR, which serves as a kind of execution context, i.e., which controls and records its exe-cution: The status field specifies the output format and contains IEEE flags that are set according to the result (underflow, overflow, etc.). The purpose of multiple status fields is, for example, that some of them can be used to store the status flags of speculated instructions, which are only later committed to architectural state (which is typically in sf0).

Group Syntax Semantics

FMAC fma.pc.sf f1=f3,f4,f2 f1=f3*f4+f2 fnma.pc.sf f1=f3,f4,f2 f1=-(f3*f4)+f2

pc=.s,.d,none Precision

sf=sf0,sf1,sf2,sf3 Status Field FMISC frcpa.sf f1,p2=f2,f3 f1=f2/f3 ∧ p2=0, or

f1=approx(1/f3) ∧ p2=1 frsqrta.sf f1,p2=f3 f1=sqrt(f3) ∧ p2=0, or

f1=approx(sqrt(f3)) ∧ p2=1 fmax.sf f1=f2,f3 f1=max(f2,f3)

fmin.sf f1=f2,f3 f1=min(f2,f3)

FCVT fcvt.xuf f1=f2 Treat 64-bit significand of f2 as an integer, convert to FP fcvt.fxu.sf f1=f2 Convert FP to (unsig.) integer XMA xma.xs f1=f3,f4,f2 Integer mul-add: f1=f3*f4+f2

x=l,u f1: lower, upper bits of sum

s=ε,u signed, unsigned

FCMP fcmp.r.t.sf p1,p2=f2,f3 Compare r=eq,ne,lt,gt,etc. like A-type cmp

t=ε,unc like A-type cmp

Table 2.6: F-type instructions.

The basic building block of all floating-point computations is the fused multiply-and-add instruction fma, which computes a multiplication combined with an addition (a frequent combi-nation in linear algebra). The combined execution of these operations is faster and more precise since only one rounding of the result occurs. If only single additions and multiplications are needed, one of the source registers can be replaced by f1 (= 1.0) and f0 (= 0.0), respectively.

Many complex operations like divide, remainder, and transcendental functions are not available in hardware, but explicitly computed by sequences of fma instructions [HKST99, CHN99].

Algorithm 3 Sequence to compute f8=f6/f7 in double precision.

A frcpa.s0 f8,p6 = f6,f7 ;;

B (p6) fma.s1 f9 = f6,f8,f0 C (p6) fnma.s1 f10 = f7,f8,f1 ;;

D (p6) fma.s1 f9 = f10,f9,f9 E (p6) fma.s1 f11 = f10,f10,f0 F (p6) fma.s1 f8 = f10,f8,f8 ;;

G (p6) fma.s1 f9 = f11,f9,f9 H (p6) fma.s1 f10 = f11, f11, f0 I (p6) fma.s1 f8 = f11,f8,f8 ;;

J (p6) fma.d.s1 f9 = f10,f9,f9 K (p6) fma.s1 f8 = f10,f8,f8 ;;

L (p6) fnma.d.s1 f6 = f7,f9,f6 ;;

M (p6) fma.d.s0 f8 = f6,f8,f9

For example, Alg. 3 shows a sequence from [Int02a] that computes f8=f6/f7 in double pre-cision. The first instruction frcpa computes an approximation (good to 8 bits) of 1/f7. Then the remaining instructions perform three (unrolled) iterations of the Newton-Raphson method to compute the correctly rounded value of f6/f7 [HKST99, MP00]. In special cases, where these iterations are not necessary or sufficient, the result is provided otherwise—either by frcpa or by an invoked software handler—then the iterations are predicated off by clearing p6.

The routine takes 30 cycles on the Itanium 2 (4 per fma, which can be executed on both available F-units). Besides the drawback of code expansion, it has several advantages to com-pute complex floating-point operations by sequences of simple atomic multiply-and-adds: the hardware is simpler and the FMA units can be optimized and fully pipelined (in contrast to typ-ical hardware implementations of division, square root, etc. [MP00]), boosting throughput and scalability. It is also possible to schedule several sequences in an interleaved manner in order to exploit the parallelism of the pipelined units.

The FMA units on this architecture are also used to compute integer multiplications: The integers are transfered to floating-point registers with setf, multiplied with the xma command and the result is returned with getf.

In document Optimal Global Instruction Scheduling for the Itanium® Processor Architecture (Page 38-44)