Optimizing C/C++ Code
SSHVL SSHVR
Shifts src2 to the left/right src1 bits. Saturates the result if the shifted value is greater than MAX_INT or less than MIN_INT.
int _sub4 (int src1, int src2); SUB4 Performs 2s-complement subtraction between pairs of packed 8-bit values
int _subabs4 (int src1, int src2); SUBABS4 Calculates the absolute value of the differences for each pair of packed 8-bit values
uint _swap4 (uint src); SWAP4 Exchanges pairs of bytes (an endian swap) within each 16-bit value
uint _unpkhu4 (uint src); UNPKHU4 Unpacks the two high unsigned 8-bit values into unsigned packed 16-bit values
uint _unpklu4 (uint src); UNPKLU4 Unpacks the two low unsigned 8-bit values into unsigned packed 16-bit values
uint _xpnd2 (uint src); XPND2 Bits 1 and 0 of src are replicated to the upper and lower halfwords of the result, respectively. uint _xpnd4 (uint src); XPND4 Bits 3 and 0 of src are replicated to bytes 3
through 0 of the result.
†See section 2.4.2, Wider Memory Access for Smaller Data Widths, on page 2-28 for more information. ‡See the TMS320C6000 Optimizing Compiler User’s Guide for details on manipulating 8-byte data quantities.
The intrinsics listed in Table 2−8 are included only for C64x+ devices. The intrinsics shown correspond to the indicated C6000 assembly language instruction(s). See the TMS320C6000 CPU and Instruction Set Reference
Guide for more information.
See Table 2−6 on page 2-15 for the listing of generic C6000 intrinsics. See Table 2−7 on page 2-19 for the listing of C64x/C64x+−specific intrinsics. See Table 2−9 on page 2-27 for the listing of C67x-specific intrinsics.
C/C++ Compiler Intrinsic Instruction Description
long long _addsub(uint src1, uint src2); ADDSUB Calculates the addition and subtraction on common inputs in parallel.
long long _addsub2(uint src1, uint src2); ADDSUB2 Calculates the 16-bit addition and subtraction on common inputs in parallel.
long long _cmpy(uint src1, uint src2); CMPY Calculates the complex multiply for the pair of 16-bit complex values.
uint _cmpyr(uint src1, uint src2); CMPYR Calculates the complex multiply for the pair of 16-bit complex values with rounding.
uint _cmpyr1(uint src1, uint src2); CMPYR1 Calculates the complex multiply for the pair of 16-bit complex values with rounding.
long long _ddotp4(uint src1, uint src2); DDOTP4 The product of the lower byte of the lower half-word of src2 and the lower half-word of src1 is added to the product of the upper byte of the lower half-word of src2 and the upper half-word of src1. The result is placed in lower destination register.
The product of the lower byte of the upper half-word of src2 and the lower half-word of src1 is added to the product of the upper byte of the upper half-word of src2 and the upper half-word of src1. The result is placed in the upper destination register.
long long _ddotph2(long long src1_o:src1_e, uint src2);
DDOTPH2 The product of the lower half-words of src1_o and src2 is added to the product of the upper half-words of src1_o and src2. The result is placed in the upper destination register. The product of the lower half-word of src1_o and the upper half-word of src2 is added to the product of the upper half-word of src1_e and the lower half-word of src2. The result is placed in the lower destination register.
†See section 2.4.2, Wider Memory Access for Smaller Data Widths, on page 2-28 for more information. ‡See the TMS320C6000 Optimizing Compiler User’s Guide for details on manipulating 8-byte data quantities.
Table 2−8. TMS320C64x+ C/C++ Compiler Intrinsics (Continued)
C/C++ Compiler Intrinsic Description Assembly
Instruction
uint _ddotph2r(long long src1_o:src1_e, uint src2);
DDOTPH2R The product of the lower half-words of src1_o and src2 is added to the product of the upper half-words of src1_o and src2. The result is rounded and placed in the upper destination register.
The product of the lower half-word of src1_o and the upper half-word of src2 is added to the product of the upper half-word of src1_e and the lower half-word of src2. The result is rounded and placed in the lower destination register. long long _ddotpl2(long long
src1_o:src1_e, uint src2);
DDOTPL2 The product of the lower half-words of src1_e and src2 is added to the product of the upper half-words of src1_e and src2. The result is placed in the lower destination register. The product of the lower half-word of src1_e and the upper half-word of src2 is added to the product of the upper half-word of src1_o and the lower half-word of src2. The result is placed in the upper destination register.
uint _ddotpl2r(long long src1_o:src1_e, uint src2);
DDOTPL2R The product of the lower half-words of src1_e and src2 is added to the product of the upper half-words of src1_e and src2. The result is rounded and placed in the lower destination register.
The product of the lower half-word of src1_e and the upper half-word of src2 is added to the product of the upper half-word of src1_o and the lower half-word of src2. The result is rounded and placed in the upper destination register. long long _dmv(uint src1, uint src2); DMV The two independent registers are moved to a
register pair.
†See section 2.4.2, Wider Memory Access for Smaller Data Widths, on page 2-28 for more information. ‡See the TMS320C6000 Optimizing Compiler User’s Guide for details on manipulating 8-byte data quantities.
C/C++ Compiler Intrinsic Instruction Description
int _dotpnrsu2(int src1, uint src2); DOTPNRSU2 The product of the lower unsigned 16-bit values in src1 and src2 is subtracted from the product of the signed upper 16-bit values of src1 and src2. 2^15 is added and the result is sign shifted right by 16. The intermediate results are maintained to 33-bit precision.
int _dotprsu2(int src1, uint src2); DOTPRSU2 The product of the first signed pair of 16-bit values is added to the product of the unsigned second pair of 16-bit values. 2^15 is added and the result is sign shifted by 16. The intermediate results are maintained to 33-bit precision. long long _dpack2(uint src1, uint src2); DPACK2 Performs PACK2 and PACKH2 operations in
parallel on common inputs.
long long _dpackx2(uint src1, uint src2); DPACKX2 Performs two PACKLH2 operations in parallel on common inputs.
uint _gmpy(uint src1, uint src2); GMPY Performs Galois Field Multiply.
long long _mpy2ir(uint src1, uint src2); MPY2IR Permorms two 16 by 32 multiplies. The product of the upper half-word of src1 and src2 is rounded, shifted and then placed in the upper destination register. The product of the lower half-word of src1 and src2 is rounded, shifted and then placed in the lower destination register. int _mpy32(int src1, int src2); MPY32 Produces a 32 by 32 multiply with a 32-bit result. long long _mpy32ll(int src1, int src2);
long long _mpy32su(int src1, uint src2); long long _mpy32u(uint src1, uint src2); long long _mpy32us(uint src1, int src2);
MPY32 MPY32SU MPY32U MPY32US
Produces a 32 by 32 multiply with a 64-bit result. The inputs and outputs can be signed or unsigned.
uint _rpack2 (uint src1, uint src2); RPACK2 The src1 and src2 inputs are shifted left by 1 withe saturation. The upper half-words of the shifted inputs are placed in the return value. long long _saddsub(uint src1, uint src2); SADDSUB Calculates the addition and subtraction with
saturation on common inputs in parallel. long long _saddsub2(uint src1, uint src2); SADDSUB2 Calculates the 16-bit addition and subtraction
with saturation on common inputs in parallel.
†See section 2.4.2, Wider Memory Access for Smaller Data Widths, on page 2-28 for more information. ‡See the TMS320C6000 Optimizing Compiler User’s Guide for details on manipulating 8-byte data quantities.
Table 2−8. TMS320C64x+ C/C++ Compiler Intrinsics (Continued)
C/C++ Compiler Intrinsic Description Assembly
Instruction
long long _shfl3 (uint src1, uint src2); SHFL3 Performs 3-way bit interleave for 3 16-bit values to produce a 48-bit result.
int _smpy32(int src1, int src2); SMPY32 Produces a 32 by 32 multiply with a 32-bit result by shifting intermediate 64-bit result left by 1 with saturation and then placing upper 32 bits of shifted result in destination register.
double _smpy2 (int src1, int sr2); SMPY2 Performs 16-bit multiplication between pairs of signed packed 16-bit values, with an additional 1 bit left-shift and saturate into a double result. uint _sub2 (uint src1, uint src2); SSUB2 Performs 16-bit subtraction with saturation. uint _xormpy (uint src1, uint src2); XORMPY Performs Galois field multiply with a zero-value
polynomial.
†See section 2.4.2, Wider Memory Access for Smaller Data Widths, on page 2-28 for more information. ‡See the TMS320C6000 Optimizing Compiler User’s Guide for details on manipulating 8-byte data quantities.
The intrinsics listed in Table 2−9 are included only for C67x devices. The intrinsics shown correspond to the indicated C6000 assembly language instruction(s). See the TMS320C6000 CPU and Instruction Set Reference
Guide for more information.
See Table 2−6 on page 2-15 for the listing of generic C6000 intrinsics. See Table 2−7 on page 2-19 for the listing of C64x/C64x+-specific intrinsics. See Table 2−8 on page 2-24 for the listing of C64x+−specific intrinsics.
Table 2−9. TMS320C67x C/C++ Compiler Intrinsics
C/C++ Compiler Intrinsic
Assembly
Instruction Description
int _dpint(double src); DPINT Converts 64-bit double to 32-bit signed integer, using the rounding mode set by the CSR register double _fabs(double src);
float _fabsf(float src);
ABSDP