ARM/Thumb Unified Assembly Language Instructions
5.1 Instruction set basics
5.2.4 Integer SIMD instructions
Single Instruction, Multiple Data (SIMD) instructions were first added in the ARMv6 architecture and provide the ability to pack, extract and unpack 8-bit and 16-bit quantities within 32-bit registers and to perform multiple arithmetic operations such as add, subtract, compare or multiply to such packed data, with a single instruction. These must not be confused with the
Table 5-4 Multiplication operations in assembly language
Opcode Operands Description Function
Multiplies
MLA Rd, Rn, Rm, Ra Multiply accumulate (MAC) Rd = Ra + (Rn × Rm) MLS Rd, Rn, Rm, Ra Multiply and Subtract Rd = Ra - (Rm × Rn)
MUL Rd, Rn, Rm Multiply Rd = Rn × Rm
SMLAL RdLo, RdHi, Rn, Rm Signed 32-bit multiply with a
64-bit accumulate RdHiLo += Rn × Rm SMULL RdLo, RdHi, Rn, Rm Signed 64-bit multiply RdHiLo = Rn × Rm UMLAL RdLo, RdHi, Rn, Rm Unsigned 64-bit MAC RdHiLo += Rn × Rm UMULL RdLo, RdHi, Rn, Rm Unsigned 64-bit multiply RdHiLo = Rn × Rm
ARM/Thumb Unified Assembly Language Instructions
significantly more powerful Advanced SIMD (NEON) operations that were introduced in the ARMv7 architecture and are covered in detail in Chapter 7 and the ARM® NEON™
Programmer’s Guide.
Integer register SIMD instructions
The ARMv6 SIMD operations make use of the GE (greater than or equal) flags within the CPSR.These are distinct from the normal condition flags. There is a flag corresponding to each of the four byte positions within a word. Normal data processing operations produce one result and set the N, Z, C and V flags (as seen in Figure 3-6 on page 3-8). The SIMD operations produce up to four outputs and set only the GE flags, to indicate overflow. The MSR and MRS instructions can be used to write or read these flags directly.
The general form of the SIMD instructions is that subword quantities in each register are operated on in parallel (for example, four byte-sized ADDs can be performed) and the GE flags are set or cleared according to the results of the instruction. Different types of add and subtract can be specified using appropriate prefixes. For example, QADD16 performs saturating addition on halfwords within a register. SADD/UADD8 and SSUB/USUB8 set the GE bits individually while SADD/UADD16 and SSUB/USUB16 set GE bits [3:2] together based on the top halfword result, and [1:0] together on the bottom halfword result.
Also available are the ASX and SAX class of instructions, that reverse halfwords of one operand and add/subtract or subtract/add parallel pairs. Like the previously described ADD and Subtract instructions, these exist as unsigned (UASX/USAX), signed (SASX/SSAX) and saturated (QASX/QSAX) versions.
Figure 5-1 SIMD ADD v6
The SADD16 instruction shown in Figure 5-1 shows how two separate addition operations are performed by a single instruction. The top halfwords of registers R3 and R0 are added, with the result going into the top halfword of register R1 and the bottom halfwords of registers R3 and R0 are added, with the result going into the bottom halfword of register R1. GE[3:2] bits in the CPSR are set based on the top halfword result and GE[1:0] based on the bottom halfword result. In each case the overflow information is duplicated in the specified pair of bits.
R3 R0
R1
SADD16 R1, R3, R0
GE[1:0] GE[3:2]
ARM/Thumb Unified Assembly Language Instructions
Integer register SIMD multiplies
Like the other SIMD operations, these operate in parallel, on subword quantities within registers. The instruction can also include an accumulate option, with and add or subtract to be specified. The instructions are SMUAD (SIMD multiply and add with no accumulate), SMUSD (SIMD multiply and subtract with no accumulate), SMLAD (multiply and add with accumulate) and SMLSD (multiply and subtract with accumulate).
Adding an L (long) before D indicates 64-bit accumulation.
Using the X (eXchange) suffix indicates halfwords in Rm are swapped before calculation. The Q flag is set if accumulation overflows.
The SMUSD instruction shown in Figure 5-2 performs two signed 16-bit multiplies (top × top and bottom × bottom) and then subtracts the two results. This kind of operation is useful when performing operations on complex numbers with a real and imaginary component, a common task for filter algorithms.
Figure 5-2 v6 SIMD signed dual multiply subtract
Sum of absolute differences
Calculating the sum of absolute differences is a key operation in the motion vector estimation component of common video codecs and is carried out over arrays of pixel data. The USADA8Rd, Rn, Rm, Ra instruction is illustrated in Figure 5-3 on page 5-11. It calculates the sum of absolute differences of the bytes within a word in registers Rn and Rm, adds in the value stored in Ra and places the result in Rd.
Rn Rm
Rd
SMUSD Rd, Rn, Rm
ARM/Thumb Unified Assembly Language Instructions
Figure 5-3 Sum of absolute differences
Data packing and unpacking
Packed data is common in many video and audio codecs, (video data is usually expressed as packed arrays of 8-bit pixel data, audio data might use packed 16-bit samples), and also in network protocols. Before additional instructions were added in the ARMv6 architecture, this data had to be either loaded with LDRH and LDRB instructions or loaded as words and then unpacked using Shift and Bit Clear operations; both are relatively inefficient. Pack (PKHBT, PKHTB) instructions permit 16-bit or 8-bit values to be extracted from any position in a register and packed into another register. Unpack instructions (UXTH, UXTB, plus many variants, including signed, with addition) can extract 8-bit or 16-bit values from any bit position within a register. This enables sequences of packed data in memory to be loaded efficiently using word or doubleword loads, unpacked into separate register values, operated on and then packed back into registers for efficient writing out to memory.
Figure 5-4 Packing and unpacking of 16-bit data in 32-bit registers
In the example shown in Figure 5-4, R0 contains two separate 16-bit values, denoted A and B. You can use the instruction to unpack the two halfwords into registers for future processing
Rn Rm Ra Rd Optional accumulation ABSDIFF ABSDIFF ABSDIFF ABSDIFF USADA8 Rd, Rn, Rm, Ra
UXTH r1, r0, ROR #16 UXTH r2, r0
PKHBT r0, r2, r1, LSL #16 B R0 A B R2 00...00 A R1 00...00 B R0 A
ARM/Thumb Unified Assembly Language Instructions
It would be possible to replace the unpack instruction in each case with a MOV and either LSL or LSR instructions, but in this case a single instruction intended to work on parts of registers is used.
Byte selection
The SEL instruction enables you to select each byte of the result from the corresponding byte in either the first or the second operand, based on the value of the GE[3:0] bits in the CPSR. The packed data arithmetic operations set these bits as a result of add or subtract operations, and SEL can be used after these to extract parts of the data – for example, to find the smaller of the two bytes in each position.
ARM/Thumb Unified Assembly Language Instructions
5.3
Memory instructions
ARM cores perform Arithmetic Logic Unit (ALU) operations only on registers. The only supported memory operations are the load (that reads data from memory into registers) or store (that writes data from registers to memory). A LDR and STR can be conditionally executed, in the same fashion as other instructions.
You can specify the size of the Load or Store transfer by appending a B for Byte, H for Halfword, or D for doubleword (64 bits) to the instruction, for example, LDRB. For loads only, an extra S can be used to indicate a signed byte or halfword (SB for Signed Byte or SH for Signed Halfword). See LDR on page A-18 for examples of this. This approach can be useful, because if you load an 8-bit or 16-bit quantity into a 32-bit register you must decide what to do with the most significant bits of the register. Unsigned numbers are zero-extended (that is, the most significant 16 or 24 bits of the register are set to zero), but for a signed number, it is necessary to copy the sign bit (bit [7] for a byte, or bit [15] for a halfword) into the top 16 (or 24) bits of the register.