** Optimizing C/C++ Code**

** SSHVL SSHVR**

Shifts src2 to the left/right src1 bits. Saturates the result if the shifted value is greater than MAX_INT or less than MIN_INT.

**int _sub4 (int src1, int src2);****SUB4** Performs 2s-complement subtraction between
pairs of packed 8-bit values

**int _subabs4 (int src1, int src2);****SUBABS4** Calculates the absolute value of the differences
for each pair of packed 8-bit values

**uint _swap4 (uint src);****SWAP4** Exchanges pairs of bytes (an endian swap)
within each 16-bit value

**uint _unpkhu4 (uint src);****UNPKHU4** Unpacks the two high unsigned 8-bit values into
unsigned packed 16-bit values

**uint _unpklu4 (uint src);****UNPKLU4** Unpacks the two low unsigned 8-bit values into
unsigned packed 16-bit values

**uint _xpnd2 (uint src);****XPND2** Bits 1 and 0 of src are replicated to the upper and
lower halfwords of the result, respectively.
**uint _xpnd4 (uint src);****XPND4** Bits 3 and 0 of src are replicated to bytes 3

through 0 of the result.

†* _{See section 2.4.2, Wider Memory Access for Smaller Data Widths, on page 2-28 for more information.}*
‡

_{See the TMS320C6000 Optimizing Compiler User’s Guide for details on manipulating 8}_{-}

_{byte data quantities.}

The intrinsics listed in Table 2−8 are included only for C64x+ devices. The
intrinsics shown correspond to the indicated C6000 assembly language
*instruction(s). See the TMS320C6000 CPU and Instruction Set Reference*

*Guide for more information.*

See Table 2−6 on page 2-15 for the listing of generic C6000 intrinsics. See Table 2−7 on page 2-19 for the listing of C64x/C64x+−specific intrinsics. See Table 2−9 on page 2-27 for the listing of C67x-specific intrinsics.

**C/C++ Compiler Intrinsic** **Instruction** **Description**

**long long _addsub(uint src1, uint src2);****ADDSUB** Calculates the addition and subtraction on
common inputs in parallel.

**long long _addsub2(uint src1, uint src2);****ADDSUB2** Calculates the 16-bit addition and subtraction on
common inputs in parallel.

**long long _cmpy(uint src1, uint src2);****CMPY** Calculates the complex multiply for the pair of
16-bit complex values.

**uint _cmpyr(uint src1, uint src2);****CMPYR** Calculates the complex multiply for the pair of
16-bit complex values with rounding.

**uint _cmpyr1(uint src1, uint src2);****CMPYR1** Calculates the complex multiply for the pair of
16-bit complex values with rounding.

**long long _ddotp4(uint src1, uint src2);****DDOTP4** The product of the lower byte of the lower
half-word of src2 and the lower half-word of src1
is added to the product of the upper byte of the
lower half-word of src2 and the upper half-word
of src1. The result is placed in lower destination
register.

The product of the lower byte of the upper half-word of src2 and the lower half-word of src1 is added to the product of the upper byte of the upper half-word of src2 and the upper half-word of src1. The result is placed in the upper destination register.

**long long _ddotph2(long long**
**src1_o:src1_e, uint src2);**

**DDOTPH2** The product of the lower half-words of src1_o
and src2 is added to the product of the upper
half-words of src1_o and src2. The result is
placed in the upper destination register.
The product of the lower half-word of src1_o and
the upper half-word of src2 is added to the
product of the upper half-word of src1_e and the
lower half-word of src2. The result is placed in the
lower destination register.

†* _{See section 2.4.2, Wider Memory Access for Smaller Data Widths, on page 2-28 for more information.}*
‡

_{See the TMS320C6000 Optimizing Compiler User’s Guide for details on manipulating 8}_{-}

_{byte data quantities.}

*Table 2−8. TMS320C64x+ C/C++ Compiler Intrinsics (Continued)*

**C/C++ Compiler Intrinsic** **Description**
**Assembly**

**Instruction**

**uint _ddotph2r(long long src1_o:src1_e,****uint src2);**

**DDOTPH2R** The product of the lower half-words of src1_o
and src2 is added to the product of the upper
half-words of src1_o and src2. The result is
rounded and placed in the upper destination
register.

The product of the lower half-word of src1_o and
the upper half-word of src2 is added to the
product of the upper half-word of src1_e and the
lower half-word of src2. The result is rounded
and placed in the lower destination register.
**long long _ddotpl2(long long**

**src1_o:src1_e, uint src2);**

**DDOTPL2** The product of the lower half-words of src1_e
and src2 is added to the product of the upper
half-words of src1_e and src2. The result is
placed in the lower destination register.
The product of the lower half-word of src1_e and
the upper half-word of src2 is added to the
product of the upper half-word of src1_o and the
lower half-word of src2. The result is placed in the
upper destination register.

**uint _ddotpl2r(long long src1_o:src1_e,****uint src2);**

**DDOTPL2R** The product of the lower half-words of src1_e
and src2 is added to the product of the upper
half-words of src1_e and src2. The result is
rounded and placed in the lower destination
register.

The product of the lower half-word of src1_e and
the upper half-word of src2 is added to the
product of the upper half-word of src1_o and the
lower half-word of src2. The result is rounded
and placed in the upper destination register.
**long long _dmv(uint src1, uint src2);****DMV** The two independent registers are moved to a

register pair.

†* _{See section 2.4.2, Wider Memory Access for Smaller Data Widths, on page 2-28 for more information.}*
‡

_{See the TMS320C6000 Optimizing Compiler User’s Guide for details on manipulating 8}_{-}

_{byte data quantities.}

**C/C++ Compiler Intrinsic** **Instruction** **Description**

**int _dotpnrsu2(int src1, uint src2);****DOTPNRSU2** The product of the lower unsigned 16-bit values
in src1 and src2 is subtracted from the product of
the signed upper 16-bit values of src1 and src2.
2^15 is added and the result is sign shifted right
by 16. The intermediate results are maintained to
33-bit precision.

**int _dotprsu2(int src1, uint src2);****DOTPRSU2** The product of the first signed pair of 16-bit
values is added to the product of the unsigned
second pair of 16-bit values. 2^15 is added and
the result is sign shifted by 16. The intermediate
results are maintained to 33-bit precision.
**long long _dpack2(uint src1, uint src2);****DPACK2** Performs PACK2 and PACKH2 operations in

parallel on common inputs.

**long long _dpackx2(uint src1, uint src2);****DPACKX2** Performs two PACKLH2 operations in parallel on
common inputs.

**uint _gmpy(uint src1, uint src2);****GMPY** Performs Galois Field Multiply.

**long long _mpy2ir(uint src1, uint src2);****MPY2IR** Permorms two 16 by 32 multiplies. The product
of the upper half-word of src1 and src2 is
rounded, shifted and then placed in the upper
destination register. The product of the lower
half-word of src1 and src2 is rounded, shifted
and then placed in the lower destination register.
**int _mpy32(int src1, int src2);****MPY32** Produces a 32 by 32 multiply with a 32-bit result.
**long long _mpy32ll(int src1, int src2);**

**long long _mpy32su(int src1, uint src2);****long long _mpy32u(uint src1, uint src2);****long long _mpy32us(uint src1, int src2);**

**MPY32**
**MPY32SU**
**MPY32U**
**MPY32US**

Produces a 32 by 32 multiply with a 64-bit result. The inputs and outputs can be signed or unsigned.

**uint _rpack2 (uint src1, uint src2);****RPACK2** The src1 and src2 inputs are shifted left by 1
withe saturation. The upper half-words of the
shifted inputs are placed in the return value.
**long long _saddsub(uint src1, uint src2);****SADDSUB** Calculates the addition and subtraction with

saturation on common inputs in parallel.
**long long _saddsub2(uint src1, uint src2);****SADDSUB2** Calculates the 16-bit addition and subtraction

with saturation on common inputs in parallel.

_{See the TMS320C6000 Optimizing Compiler User’s Guide for details on manipulating 8}_{-}_{byte data quantities.}

*Table 2−8. TMS320C64x+ C/C++ Compiler Intrinsics (Continued)*

**C/C++ Compiler Intrinsic** **Description**
**Assembly**

**Instruction**

**long long _shfl3 (uint src1, uint src2);****SHFL3** Performs 3-way bit interleave for 3 16-bit values
to produce a 48-bit result.

**int _smpy32(int src1, int src2);****SMPY32** Produces a 32 by 32 multiply with a 32-bit result
by shifting intermediate 64-bit result left by 1 with
saturation and then placing upper 32 bits of
shifted result in destination register.

**double _smpy2 (int src1, int sr2);****SMPY2** Performs 16-bit multiplication between pairs of
signed packed 16-bit values, with an additional
1 bit left-shift and saturate into a double result.
**uint _sub2 (uint src1, uint src2);****SSUB2** Performs 16-bit subtraction with saturation.
**uint _xormpy (uint src1, uint src2);****XORMPY** Performs Galois field multiply with a zero-value

polynomial.

_{See the TMS320C6000 Optimizing Compiler User’s Guide for details on manipulating 8}_{-}_{byte data quantities.}

The intrinsics listed in Table 2−9 are included only for C67x devices. The
intrinsics shown correspond to the indicated C6000 assembly language
*instruction(s). See the TMS320C6000 CPU and Instruction Set Reference*

*Guide for more information.*

See Table 2−6 on page 2-15 for the listing of generic C6000 intrinsics. See Table 2−7 on page 2-19 for the listing of C64x/C64x+-specific intrinsics. See Table 2−8 on page 2-24 for the listing of C64x+−specific intrinsics.

*Table 2−9. TMS320C67x C/C++ Compiler Intrinsics *

**C/C++ Compiler Intrinsic**

**Assembly**

**Instruction** **Description**

**int _dpint(double src);****DPINT** Converts 64-bit double to 32-bit signed integer,
using the rounding mode set by the CSR register
**double _fabs(double src);**

**float _fabsf(float src);**

**ABSDP**