The scope of this appendix is restricted to software optimisation on Intel platforms because these appear to be the most common platforms nowadays. Currently, the most popular Intel processors are those of families descending from the Core architecture
which introduced many revolutionary features to improve the processing performance, these features include:
• Wide Dynamic Execution. With Core Microarchitecture, each core can execute up to 4 instructions simultaneously. Intel also introduced Macro-fusion and Micro- fusion, which fuse micro-ops.
• Micro-fusion operates on low-level processor instructions called micro-operations3
(micro-ops) to fuse multiple micro-ops of the same instruction into a single (complex) micro-op.
• Macro-fusion operates on instructions and combines common pairs into a sin- gle micro-op. In Intel Core Microarchitecture macro-fusion is only supported in 32-bit mode but with the introduction of Intel Microarchitecture (Ne- halem), macro-fusion is now supported in the 64-bit mode too. Currently, the first instruction of the marco-fused pair is restricted to CMP (compare) or TEST (test) instructions followed by a conditional jump instruction that checks the CF and/or ZF flags in the EFLAG register; see sectionA.3.2. In Intel Core Microarchitecture, supported conditional jump instructions are: JA/JNBE, JE/JZ, JNA/JBE, JNE/JNZ, JAE/JNB/JC and JNAE/JB/JC — condition codes supporting unsigned values. Intel Microarchitecture (Ne- halem) adds support to: JL/JNGE, JGE/JNL, JLE/JNG and JG/JNLE — condition codes supporting signed values [82]; see table A.1 for description of condition codes.
• Advanced Smart Cache. This feature allows each core to dynamically utilise up to 100% of the (fast) L2 cache if available, which was previously not possible resulting in an inefficient use of the cache.
• Advanced Digital Media Boost. This feature improved the execution of SIMD instructions by allowing the whole 128-bit instruction to be executed in one clock cycle, which used to utilise several clock cycles previously.
A.3.1 SIMD Instruction Set
In SIMD (Single Instruction, Multiple Data), a single instruction operates on multiple data simultaneously achieving a data level parallelism. SMID instructions are especially useful when the same operation needs to be executed repeatedly on different data (e.g.
3
Note that micro-ops are executed directly by the hardware and should not be confused with normal instructions which are themselves composed of several micro-ops
hashing a message). In 1997, Intel introduced the MMX instruction set based on SIMD, which was later improved by introducing the SSE (Streaming SIMD Extensions). Later versions of SSE include: SSE2, SSE3, SSSE3, SSE4, SSE5 and recently AVX. Most of these instructions were introduced to support heavy processing applications like multimedia, gaming, signal processing and modelling. Today, SIMD instructions are available in most of the recent platforms from Intel and AMD to ARM.
A.3.2 Registers in Intel Platforms
Registers are very fast storage mediums located near the processors. Below we provide brief descriptions of some register types found in most Intel platforms:
• General-purpose Registers: these are eight 32-bit registers (EAX, EBX, ECX, EDX, EBP, ESI, EDI, ESP) hold operands and memory pointers. In 64-bit mode, there are sixteen 64-bit registers; these are EAX, EBX, ECX, EDX, EBP, ESI, EDI, ESP, R8D-R15D for 32-bit operands, and RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, R8-R15 for 64-bit operands.
• Segment Registers: these are six 16-bit registers (CS, DS, SS, ES, FS, GS) used to hold pointers.
• MMX Registers: these are eight 64-bit registers (MM0 – MM7) introduced with the MMX technology to perform operations on 64-bit data.
• XMM Registers: these are either eight (in 32-bit mode) or sixteen (in 64-bit mode) 128-bit registers (XMM0 – XMM15), introduced to handle the SSE 128-bit data types.
• MXCSR Register. This is a 32-bit register used to control some of the SSEx floating-point operations.
• EIP Register. This is a 32-bit register that holds the instruction pointer which points to the next instruction to be executed.
• EFLAG Register : this is a single 32-bit (in 32-bit mode) register used to reflect the results of comparison, arithmetic and other instructions by setting its flags appropriately. In 64-bit mode, EFLAG is called RFLAG and is 64-bit, the lower 32-bit are identical to EFLGA, and the upper 32-bit are reserved.
The EFLAG register contains 1 control flag, 6 status flags, 11 system flags and the rest are reserved bits. Below we elaborate on EFLAG’s 6 status flags, which are relevant to the discussions in the following sections.
Code Description Flag
O Overflow OF=1
NO No Overflow OF=1
B/NAE Below/Not Above or Equal CF = 1 NB/AE Not Below/Above or Equal CF = 0
E/Z Equal/Zero ZF = 1
NE/NZ Not Equal/Not Zero ZF = 0
BE/NA Below or Equation/Not Above CF ∨ ZF = 1 NBE/A Not Below or Equal Above CF ∨ ZF = 0
S Sign SF = 1
NS No Sign SF = 0
P/PE Parity/Parity Even PF =1
NP/PO Not Parity/Parity Odd PF = 0
L/NGE Less/Not Greater than or Equal SP ⊕ OF = 1 NL/GE Not Less than/Greater than or Equal SF ⊕ OF = 0
LE/NG Less than or Equal/Not Greater than ((SF ⊕ OF) ∨ ZF) = 1 NLE/G Not Less than or Equal/Greater than ((SF ⊕ OF) ∨ ZF) = 0
Table A.1: Intel condition codes
• Carry Flag CF: set if a carry or borrow of an arithmetic operation is generated. • Parity Flag PF: set if the number of 1’s in the least significant byte of the result
from the previous operation is even (indicating even parity), otherwise it is unset (indicating odd parity).
• Adjust Flag AF: this flag is primarily used in BCD (binary-coded decimal) arith- metic and is set only if the previous operation generates a carry or borrow.
• Zero Flag ZF: set if the result of the previous operation is zero.
• Sign Flag SF: used with signed values and is set to the same value of the sign bit (most significant bit), which is 0 if positive or 1 if negative.
• Overflow Flag OF: set if the result of an operation doesnt fit the specified desti- nation operand (e.g. storing a 64-bit value in a 32-bit register).
These flags can be tested by other instructions by suffixing a “condition code” to the instruction. Some condition codes are described in table A.1[83].
As indicated in table A.1, some condition code mnemonics are synonym to oth- ers, like B (Below) and NAE (Not Above or Equal). Intel adopts this convention for situations where using one mnemonic is more intelligible than the other. Also, When dealing with unsigned values, the below/above mnemonics are used; but when dealing with signed values, the greater than/less than mnemonics are used instead.