Top Qs
Timeline
Chat
Perspective

X86 SIMD instruction listings

From Wikipedia, the free encyclopedia

Remove ads

The x86 instruction set has several times been extended with SIMD (Single instruction, multiple data) instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.

Remove ads

Summary of SIMD extensions

Summarize
Perspective

The main SIMD instruction set extensions that have been introduced for x86 are:

More information SIMD instruction set extension, Year ...
  1. The count of 13 instructions for SSE3 includes the non-SIMD instructions MONITOR and MWAIT that were also introduced as part of "Prescott New Instructions" - these two instructions are considered to be SSE3 instructions by Intel but not by AMD.
  2. On older Zhaoxin processors, such as KX-6000 "LuJiaZui", AVX2 instructions are present but not exposed through CPUID due to the lack of FMA3 support.[1]
  3. Early drafts of the AVX10 specification also added an option for implementations to limit the maximum supported vector-register width to 128/256 bits[2] - however, as of March 2025, this option has been removed, making support for 512-bit vector-register width mandatory again.[3][4]
Remove ads

MMX instructions and extended variants thereof

Summarize
Perspective

These instructions are, unless otherwise noted, available in the following forms:

  • MMX: 64-bit vectors, operating on mm0..mm7 registers (aliased on top of the old x87 register file)
  • SSE2: 128-bit vectors, operating on xmm0..xmm15 registers (xmm0..xmm7 in 32-bit mode)
  • AVX: 128-bit vectors, operating on xmm0..xmm15 registers, with a new three-operand encoding enabled by the new VEX prefix. (AVX introduced 256-bit vector registers, but the full width of these vectors was in general not made available for integer SIMD instructions until AVX2.)
  • AVX2: 256-bit vectors, operating on ymm0..ymm15 registers (extended versions of the xmm0..xmm15 registers)
  • AVX-512: 512-bit vectors, operating on zmm0..zmm31 registers (zmm0..zmm15 are extended versions of the ymm0..ymm15 registers, while zmm16..zmm31 are new to AVX-512). AVX-512 also introduces opmasks, allowing the operation of most instructions to be masked on a per-lane basis by an opmask register (the lane width varies from one instruction to another). AVX-512 also adds broadcast functionality for many of its instructions - this is used with memory source arguments to replicate a single value to all lanes of a vector calculation. The tables below provide indications of whether opmasks and broadcasts are supported for each instruction, and if so, what lane-widths they are using.

For many of the instruction mnemonics, (V) is used to indicate that the instruction mnemonic exists in forms with and without a leading V - the form with the leading V is used for the VEX/EVEX-prefixed instruction variants introduced by AVX/AVX2/AVX-512, while the form without the leading V is used for legacy MMX/SSE encodings without VEX/EVEX-prefix.

Original Pentium MMX instructions, and SSE2/AVX/AVX-512 extended variants thereof

More information Description, Instruction mnemonics ...
  1. EMMS will also set the x87 top-of-stack to 0.
    Unlike the older FNINIT instruction, EMMS will not update the FPU Control Word, nor will it update any part of the FPU Status Register other than the top-of-stack. If there are any unmasked pending x87 exceptions, EMMS will raise the exception while FNINIT will clear it.
  2. The 0F 77 opcode can be VEX-encoded (resulting in the AVX VZEROUPPER and VZEROALL instructions), but this requires a VEX.NP prefix, not a VEX.66 prefix.
  3. The 64-bit move instruction forms that are encoded by using a REX.W prefix with the 0F 6E and 0F 7E opcodes are listed with different mnemonics in Intel and AMD documentation — MOVQ in Intel documentation[5] and MOVD in AMD documentation.[6]
    This is a documentation difference only — the operation performed by these opcodes is the same for Intel and AMD.
    This documentation difference applies only to the MMX/SSE forms of these opcodes — for VEX/EVEX-encoded forms, both Intel and AMD use the mnemonic VMOVQ.)
  4. The REX.W-encoded variants of MOVQ are available in 64-bit "long mode" only. For SSE2 and later, MOVQ to and from xmm/ymm/zmm registers can also be encoded with F3 0F 7E /r and 66 0F D6 /r respectively - these encodings are shorter and available outside 64-bit mode.
  5. On all Intel,[7] AMD[8] and Zhaoxin[9] processors that support AVX, the 128-bit forms of VMOVDQA (encoded with a VEX prefix and VEX.L=0) are, when used with a memory argument addressing WB (write-back cacheable) memory, architecturally guaranteed to perform the 128-bit memory access atomically - this applies to both load and store.

    (Intel and AMD provide somewhat wider guarantees covering more 128-bit instruction variants, but Zhaoxin provides the guarantee for cacheable VMOVDQA only.)

    While 128-bit VMOVDQA is atomic, it is not locked — it can be reordered in the same way as normal x86 loads/stores (e.g. loads passing older stores).

    On processors that support SSE but don't support AVX, the 128-bit forms of SSE load/store instructions such as MOVAPS/MOVAPD/MOVDQA are not guaranteed to execute atomically — examples of processors where such instructions have been observed to execute non-atomically include Intel Core Duo and AMD K10.[10]

  6. VMOVDQA is available with a vector length of 256 bits under AVX, not requiring AVX2.

    Unlike the 128-bit form, the 256-bit form of VMOVDQA does not provide any special atomicity guarantees.

  7. For the VPACK* and VPUNPCK* instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  8. For the memory argument forms of (V)PUNPCKL* instructions, the memory argument is half-width only for the MMX variants of the instructions. For SSE/AVX/AVX-512 variants, the width of the memory argument is the full vector width even though only half of it is actually used.
  9. The EVEX-encoded variants of the VPCMPEQ* and VPCMPGT* instructions write their results to AVX-512 opmask registers. This differs from the older non-EVEX variants, which write comparison results as vectors of all-0s/all-1s values to the regular mm/xmm/ymm vector registers.
  10. The (V)PMADDWD instruction will add multiplication results pairwise, but will not add the sum to an accumulator. AVX512_VNNI provides the instructions VDPWSSD and WDPWSSDS, which will add multiplication results pairwise, and then also add them to a per-32-bit-lane accumulator.
  11. For the MMX packed shift instructions PSLL* and PSR* with a shift-argument taken from a vector source (mm or m64), the shift-amount is considered to be a single 64-bit scalar value - the same shift-amount is used for all lanes of the destination vector. This shift-amount is unsigned and is not masked - all bits are considered (e.g. a shift-amount of 0x80000000_00000000 can be specified and will have the same effect as a shift-amount of 64).

    For all SSE2/AVX/AVX512 extended variants of these instructions, the shift-amount vector argument is considered to be a 128-bit (xmm or m128) argument - the bottom 64 bits are used as the shift-amount.

    Packed shift-instructions that can take a variable per-lane shift-amount were introduced in AVX2 for 32/64-bit lanes and AVX512BW for 16-bit lanes (VPSLLV*, VPSRLV*, VPSRAV* instructions).

MMX instructions added with MMX+/SSE/SSE2/SSSE3, and SSE2/AVX/AVX-512 extended variants thereof

More information Description, Instruction mnemonics ...
  1. For shuffle of four 16-bit integers in a 64-bit section of a 128-bit XMM register, the SSE2 instructions PSHUFLW (opcode F2 0F 70 /r) or PSHUFHW (opcode F3 0F 70 /r) may be used.
  2. For the VPSHUFD, VPSHUFB, VPHADD*, VPHSUB* and VPALIGNR instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  3. For the VEX-encoded forms of the VPINSRW and VPEXTRW instruction, the Intel SDM (as of rev 084) indicates that the instructions must be encoded with VEX.W=0, however neither Intel XED nor AMD APM indicate any such requirement.
  4. The 0F C5 /r ib variant of PEXTRW allows register destination only. For SSE4.1 and later, a variant that allows a memory destination is available with the opcode 66 0F 3A 15 /r ib.
  5. EVEX-prefixed opcode not available. Under AVX-512, a bitmask made from the top bit of each byte can instead be constructed with the VPMOVB2M instruction, with opcode EVEX.F3.0F38.W0 29 /r, which will store such a bitmask to an opmask register.
  6. VMOVNTDQ is available with a vector length of 256 bits under AVX, not requiring AVX2.
  7. For the MASKMOVQ and (V)MASKMOVDQU instructions, exception and trap behavior for disabled lanes is implementation-dependent. For example, a given implementation may signal a data breakpoint or a page fault for bytes that are zero-masked and not actually written.
  8. For AVX, masked stores to memory are also available using the VMASKMOVPS instruction with opcode VEX.66.0F38 2E /r - unlike VMASKMOVDQU, this instruction allows 256-bit stores without temporal hints, although its mask is coarser - 4 bytes vs 1 byte per lane.
  9. Opcode not available under AVX-512. Under AVX-512, unaligned masked stores to memory (albeit without temporal hints) can be done with the VMOVDQU(8|16|32|64) instructions with opcode EVEX.F2/F3.0F 7F /r, using an opmask register to provide a write mask.
  10. For AVX2 and AVX-512 with vectors wider than 128 bits, the VPSHUFB instruction is restricted to byte-shuffle within each 128-bit lane. Instructions that can do shuffles across 128-bit lanes include e.g. AVX2's VPERMD (shuffle of 32-bit lanes across 256-bit YMM register) and AVX512_VBMI's VPERMB (full byte shuffle across 64-byte ZMM register).
  11. For AVX-512, VPALIGNR is supported but will perform its operation within each 128-bit lane. For packed alignment shifts that can shift data across 128-bit lanes, AVX512F's VALIGND instruction may be used, although its shift-amount is specified in units of 32-bits rather than bytes.
Remove ads

SSE instructions and extended variants thereof

Summarize
Perspective

Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof

For the instructions in the below table, the following considerations apply unless otherwise noted:

  • Packed instructions are available at all vector lengths (128-bit for SSE2, 128/256-bit for AVX, 128/256/512-bit for AVX-512)
  • FP32 variants of instructions are introduced as part of SSE. FP64 variants of instructions are introduced as part of SSE2.
  • The AVX-512 variants of the FP32 and FP64 instructions are introduced as part of the AVX512F subset.
  • For AVX-512 variants of the instructions, opmasks and broadcasts are available with a width of 32 bits for FP32 operations and 64 bits for FP64 operations. (Broadcasts are available for vector operations only.)

From SSE2 onwards, some data movement/bitwise instructions exist in three forms: an integer form, an FP32 form and an FP64 form. Such instructions are functionally identical, however some processors with SSE2 will implement integer, FP32 and FP64 execution units as three different execution clusters, where forwarding of results from one cluster to another may come with performance penalties and where such penalties can be minimzed by choosing instruction forms appropriately. (For example, there exists three forms of vector bitwise XOR instructions under SSE2 - PXOR, XORPS, and XORPD - these are intended for use on integer, FP32, and FP64 data, respectively.)

More information Instruction Description, Basic opcode ...
  1. The VEX-prefix-encoded variants of the scalar instructions listed in this table should be encoded with VEX.L=0. Setting VEX.L=1 for any of these instructions is allowed but will result in what the Intel SDM describes as "unpredictable behavior across different processor generations". This also applies to VEX-encoded variants of V(U)COMISS and V(U)COMISD. (This behavior does not apply to scalar instructions outside this table, such as e.g. VMOVD/VMOVQ, where VEX.L=1 results in an #UD exception.)
  2. EVEX-encoded variants of VMOVAPS, VMOVUPS, VMOVAPD and VMOVUPD support opmasks but do not support broadcast.
  3. The SSE2 MOVSD (MOVe Scalar Double-precision) and CMPSD (CoMPare Scalar Double-precision) instructions have the same names as the older i386 MOVSD (MOVe String Doubleword) and CMPSD (CoMPare String Doubleword) instructions, however their operations are completely unrelated.

    At the assembly language level, they can be distinguished by their use of XMM register operands.

  4. For variants of VMOVLPS, VMOVHPS, VMOVLPD, VMOVHPD, VMOVLHPS, VMOVHLPS encoded with VEX or EVEX prefixes, the only supported vector length is 128 bits (VEX.L=0 or EVEX.L=0).

    For the EVEX-encoded variants, broadcasts and opmasks are not supported.

  5. The MOVSLDUP, MOVSHDUP and MOVDDUP instructions are not regularly-encoded scalar SSE1/2 instructions, but instead irregularly-assigned SSE3 vector instructions. For a description of these instructions, see table below.
  6. For the VUNPCK*, VSHUFPS and VSHUFPD instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction (except that for VSHUFPD, each 128-bit lane will use a different 2-bit part of the instruction's imm8 argument).
  7. The CVTPI2PS and CVTPI2PD instructions take their input data as a vector of two 32-bit signed integers from either memory or MMX register. They will cause an x87→MMX transition even if the source operand is a memory operand.

    For vector int→FP conversions that can accept an xmm/ymm/zmm register or vectors wider than 64 bits as input arguments, SSE2 provides the following irregularly-assigned instructions (see table below):

    • CVTDQ2PS (0F 5B /r)
    • CVTDQ2PD (F3 0F E6 /r)
    These exist in AVX/AVX-512 extended forms as well.
  8. For the (V)CVTSI2SS and (V)CVTSI2SD instructions, variants with a 64-bit source argument are only available in 64-bit long mode and require REX.W, VEX.W or EVEX.W to be set to 1.

    In 32-bit mode, their source argument is always 32-bit even if VEX.W or EVEX.W is set to 1.

  9. EVEX-encoded variants of
    • VMOVNTPS, VMOVNTSS
    • VCOMISS, VCOMISD, VUCOMISS, VUCOMISD
    • VCVTSI2SS, VCTSI2SD
    • VCVT(T)SS2SI,VCVT(T)SD2SI
    support neither opmasks nor broadcast.
  10. The CVT(T)PS2PI and CVT(T)PD2PI instructions write their result to MMX register as a vector of two 32-bit signed integers.

    For vector FP→int conversions that can write results to xmm/ymm/zmm registers, SSE2 provides the following irregularly-assigned instructions (see table below):

    • CVTPS2DQ (66 0F 5B /r)
    • CVTTPS2DQ (F3 0F 5B /r)
    • CVTPD2DQ (F2 0F E6 /r)
    • CVTTPD2DQ (66 0F E6 /r)
    These exist in AVX/AVX-512 extended forms as well.
  11. For the (V)CVT(T)SS2SI and (V)CVT(T)SD2SI instructions, variants with a 64-bit destination register are only available in 64-bit long mode and require REX.W, VEX.W or EVEX.W to be set to 1.

    In 32-bit mode, their destination register is always 32-bit even if VEX.W or EVEX.W is set to 1.

  12. This instruction cannot be EVEX-encoded. Under AVX512DQ, extracting packed floating-point sign-bits can instead be done with the VPMOVD2M and VPMOVQ2M instructions.
  13. The (V)RCPSS, (V)RCPPS, (V)RSQRTSS and (V)RSQRTPS approximation instructions compute their result with a relative error of at most . The exact calculation is implementation-specific and known to vary between different x86 CPUs.[11]
  14. This instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes - EVEX.66.0F38 4E/4F /r - for its new VRSQRT14* reciprocal square root approximation instructions.

    The main difference between the AVX-512 VRSQRT14* instructions and the older SSE/AVX (V)RSQRT* instructions is that the AVX-512 VRSQRT14* instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel.[12]

  15. This instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes - EVEX.66.0F38 4C/4D /r - for its new VRCP14* reciprocal approximation instructions.

    The main difference between the AVX-512 VRCP14* instructions and the older SSE/AVX (V)RCP* instructions is that the AVX-512 VRRCP14* instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel.[12]

  16. The EVEX-encoded versions of the VANDPS, VANDPD, VANDNPS, VANDNPD, VORPS, VORPD, VXORPS, VXORPD instructions are not introduced as part of the AVX512F subset, but instead the AVX512DQ subset.
  17. XORPS/VXORPS with both source operands being the same register is commonly used as a register-zeroing idiom, and is recognized by most x86 CPUs as an instruction that does not depend on its source arguments.
    Under AVX or AVX-512, it is recommended to use a 128-bit form of VXORPS for this purpose - this will, on some CPUs, result in fewer micro-ops than wider forms while still achieving register-zeroing of the whole 256 or 512 bit vector-register.[13]
  18. For EVEX-encoded variants of conversions between FP formats of different widths, the opmask lane width is determined by the result format: 64-bit for VCVTPS2PD and VCVTSS2SD and 32-bit for VCVTPD2PS and VCVTSS2SD.
  19. Widening FP→FP conversions (CVTPS2PD, CVTSS2SD, VCVTPH2PD, VCVTSH2SD) support the SAE modifier. Narrowing conversions (CVTPD2PS, CVTSD2SS) support the RC modifier.
  20. For the floating-point minimum-value and maximum-value instructions (V)MIN* and (V)MAX*, if the two input operands are both zero or at least one of the input operands is NaN, then the second input operand is returned. This matches the behavior of common C programming-language expressions such as ((op1)>(op2)?(op1):(op2)) for maximum-value and ((op1)<(op2)?(op1):(op2)) for minimum-value.
  21. For the SIMD floating-point compares, the imm8 argument has the following format:
    More information Bits, Usage ...
    The basic comparison predicates are:
    More information Value, Meaning ...
    A signalling compare will cause an exception if any of the inputs are QNaN.

Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof

These instructions do not have any MMX forms, and do not support any encodings without a prefix. Most of these instructions have extended variants available in VEX-encoded and EVEX-encoded forms:

  • The VEX-encoded forms are available under AVX/AVX2. Under AVX, they are available only with a vector length of 128 bits (VEX.L=0 enocding) - under AVX2, they are (with some exceptions noted with "L=0") also made available with a vector length of 256 bits.
  • The EVEX-encoded forms are available under AVX-512 - the specific AVX-512 subset needed for each instruction is listed along with the instruction.
More information Description, Instruction mnemonics ...
  1. For the (V)PUNPCK*, (V)PACKUSDW, (V)PBLENDW, (V)PSLLDQ and (V)PSLRDQ instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  2. Assemblers may accept PBLENDVB with or without XMM0 as a third argument.
  3. The PBLENDVB instruction with opcode 66 0F38 10 /r is not VEX-encodable. AVX does provide a VPBLENDVB instruction that is similar to PBLENDVB, however, it uses a different opcode and operand encoding - VEX.66.0F3A.W0 4C /r /is4.
  4. Opcode not EVEX-encodable. Under AVX-512, variable blend of packed bytes may be done with the VPBLENDMB instruction (opcode EVEX.66.0F38.W0 66 /r).
  5. The EVEX-encoded variants of the VPCMPEQ* and VPCMPGT* instructions write their results to AVX-512 opmask registers. This differs from the older non-EVEX variants, which write comparison results as vectors of all-0s/all-1s values to the regular mm/xmm/ymm vector registers.
  6. The load performed by (V)MOVNTDQA is weakly-ordered. It may be reordered with respect to other loads, stores and even LOCKs - to impose ordering with respect to other loads/stores, MFENCE or serialization is needed.

    If (V)MOVNTDQA is used with uncached memory, it may fetch a cache-line-sized block of data around the data actually requested - subsequent (V)MOVNTDQA instructions may return data from blocks fetched in this manner as long as they are not separated by an MFENCE or serialization.

  7. For AVX, the VBLENDPS and VPBLENDD instructions can be used to perform a blend with 32-bit lanes, allowing one imm8 mask to span a full 256-bit vector without repetition.
  8. Opcode not EVEX-encodable. Under AVX-512, variable blend of packed words may be done with the VPBLENDMW instruction (opcode EVEX.66.0F38.W1 66 /r).
  9. For (V)PEXTRB and (V)PEXTRW, if the destination argument is a register, then the extracted 8/16-bit value is zero-extended to 32/64 bits.
  10. For the VPEXTRD and VPINSRD instructions in non-64-bit mode, the instructions are documented as being permitted to be encoded with VEX.W=1 on Intel[14] but not AMD[15] CPUs (although exceptions to this do exist, e.g. Bulldozer permits such encodings[16] while Sandy Bridge does not[17])
    In 64-bit mode, these instructions require VEX.W=0 on both Intel and AMD processors — encodings with VEX.W=1 are interpreted as VPEXTRQ/VPINSRQ.
  11. In the case of a register source argument to (V)PINSRB, the argument is considered to be a 32-bit register of which the 8 bottom bits are used, not an 8-bit register proper. This means that it is not possible to specify AH/BH/CH/DH as a source argument to (V)PINSRB.
  12. EVEX-encoded variants of the VMPSADBW instruction are only available if AVX10.2 is supported.
  13. The SSE4.2 packed string compare PCMP*STR* instructions allow their 16-byte memory operands to be misaligned even when using legacy SSE encoding.

Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof

SSE SIMD instructions that do not fit into any of the preceding groups. Many of these instructions have AVX/AVX-512 extended forms - unless otherwise indicated (L=0 or footnotes) these extended forms support 128/256-bit operation under AVX and 128/256/512-bit operation under AVX-512.

More information Description, Instruction mnemonics ...
  1. For the VPSHUFLW, VPSHUFHW, VHADDP*, VHSUBP*, VDPPS and VDPPD instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
  2. Under AVX, the VPSHUFHW and VPSHUFLW instructions are only available in 128-bit forms - the 256-bit forms of these instructions require AVX2.
  3. For the EVEX-encoded form of VCVTDQ2PD, EVEX embedded rounding controls are permitted but have no effect.
  4. Opcode not EVEX-encodable. Performing a vector logical test under AVX-512 requires a sequence of at least 2 instructions, e.g. VPTESTMD followed by KORTESTW.
  5. Assemblers may accept the BLENDVPS/BLENDVPD instructions with or without XMM0 as a third argument.
  6. While AVX does provide VBLENDVPS/VPD instruction that are similar in function to BLENDVPS/VPD, they uses a different opcode and operand encoding - VEX.66.0F3A.W0 4A/4B /r /is4.
  7. Opcode not available under AVX-512. Instead, AVX512F provides different opcodes - EVEX.66.0F3A (08..0B) /r ib - for its new VRNDSCALE* rounding instructions.
  8. Under AVX-512, EVEX-encoding the INSERTQ/EXTRQ opcodes result in AVX-512 instructions completely unrelated to SSE4a, namely VCVT(T)P(S|D)2UQQ and VCVT(T)S(S|D)2USI.


Remove ads

AVX/AVX2 instructions, and AVX-512 extended variants thereof

Summarize
Perspective

This covers instructions/opcodes that are new to AVX and AVX2.

AVX and AVX2 also include extended VEX-encoded forms of a large number of MMX/SSE instructions - please see tables above.

Some of the AVX/AVX2 instructions also exist in extended EVEX-encoded forms under AVX-512 as well.

AVX instructions

More information Instruction description, Instruction mnemonics ...
  1. For code that may potentially mix use of legacy-SSE instructions with 256-bit AVX instructions, it is strongly recommended to execute a VZEROUPPER or VZEROALL instruction after executing AVX instructions but before executing SSE instructions. If this is not done, any subsequent legacy-SSE code may be subject to severe performance degradation.[18]
  2. While the VZEROUPPER and VZEROALL instructions are architecturally listed as ignoring the VEX.W bit, some early AVX implementations (e.g. Sandy Bridge[19]) will #UD if the VZEROUPPER and VZEROALL instructions are encoded with VEX.W=1. For this reason, it is recommended to encode these instructions with VEX.W=0.
  3. VBROADCASTSS and VBROADCASTSD with a register source operand are not supported under AVX - support for xmm-register source operands for these instructions was added in AVX2.
  4. The V(P)BROADCAST* instructions perform broadcast as part of their normal operation - under AVX-512 with EVEX prefix, they do not require or accept the EVEX.b modifier.
  5. The VBROADCASTSD instruction does not support broadcast of 64-bit data into a 128-bit vector. For broadcast of 64-bit data into a 128-bit vector, the SSE3 (V)MOVDDUP instruction or the AVX2 VPBROADCASTQ instruction may be used.
  6. Under AVX-512, EVEX-encoded forms of the VMASKMOVP(S|D) instructions are not available. For masked moves of FP32/FP64 values to/from memory under AVX-512, the VMOVUPS and VMOVUPD may be used with an opmask register.
  7. Under AVX, the VPBLENDVB instruction is only available with a 128-bit vector width (VEX.L=0). Support for 256-bit vector width was added in AVX2.

AVX2 instructions

More information Instruction description, Instruction mnemonics ...
  1. For AVX-512, variants of the VPBROADCAST(B/W/D/Q) instructions that can use a general-purpose register as source exist as well, with opcodes EVEX.66.0F38.W0 (7A..7C)
  2. The V(P)BROADCAST* instructions perform broadcast as part of their normal operation - under AVX-512 with EVEX prefix, they do not require or accept the EVEX.b modifier.
  3. For VPERMPS, VPERMPD, VPERMD and VPERMQ, minimum supported vector width is 256 bits. For shuffles in a 128-bit vector, use VPERMILPS or VPERMILPD.
  4. Under AVX-512, executing the VPERMPD and VPERMQ instructions with a vector width of 512 bits will cause the operation to be split into two 256-bit halves, with the imm8 swizzle being applied to each half separately.
    Under AVX-512, variable-shuffle variants of the VPERMPD and VPERMQ instructions exist with opcodes EVEX.66.0F38.W1 16 /r and EVEX.66.0F38.W1 36 /r, respectively - these variants do not split their operation into 256-bit halves.
  5. For EVEX-encoded forms of the V(P)GATHER* instructions under AVX-512, lane-masking is done with an opmask register instead of an XMM/YMM/ZMM vector register.

Other VEX-encoded SIMD instructions

SIMD instructions set extensions that are using the VEX prefix, and are not considered part of baseline AVX/AVX2/AVX-512, FMA3/4 or AMX.

Integer, opmask and cryptographic instructions that use the VEX prefix (e.g. the BMI2, CMPccXADD, VAES and SHA512 extensions) are not included.

More information Instruction set extension, Instruction description ...
  1. For the VCVTPS2PH instruction, if bit 2 if the imm8 argument is set, then the rounding mode to use is taken from the MXCSR, else the rounding mode is taken from bits 1:0 of the imm8 (the top 5 bits of the imm8 are ignored). The supported rounding modes are:
    More information Value, Rounding mode ...
  2. The VCVTNEPS2BF16 is the only AVX512_BF16 instruction for which the AVX-NE-CONVERT extension provides a VEX-encoded form. The other AVX512_BF16 instructions (none of which have any VEX-encoded forms) are not listed here.


Remove ads

FMA3 and FMA4 instructions

Summarize
Perspective

Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands.

FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however, AVX-512 foundation instructions.
The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.
FMA3 encoding
FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form VEX.66.0F38 xy /r or EVEX.66.0F38 xy /r. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte xy consists of two nibbles, where the top nibble x selects operand ordering (9='132', A='213', B='231') and the bottom nibble y (values 6..F) selects which one of the 10 fused-multiply-add operations to perform. (x and y outside the given ranges will result in something that is not an FMA3 instruction.)
At the assembly language level, the operand ordering is specified in the mnemonic of the instruction:

  • vfmadd132sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm1*xmm3)+xmm2
  • vfmadd213sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm1)+xmm3
  • vfmadd231sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm3)+xmm1

For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512 and AVX10, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions – these all take the form EVEX.66.MAP6.W0 xy /r with the opcode byte working in the same way as for the FP32/FP64 variants. The AVX10.2 extension, published in 2024,[20] similarly adds BF16 variants of the packed (but not scalar) FMA3 instructions – these all take the form EVEX.NP.MAP6.W0 xy /r with the opcode byte again working similar to the FP32/FP64 variants. (For the FMA4 instructions, no FP16 or BF16 variants are defined.)
FMA4 encoding
FMA4 instructions are encoded with the VEX prefix, on the form VEX.66.0F3A xx /r ib (no EVEX encodings are defined). The opcode byte xx uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform.

For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example:

  • vfmaddsd xmm1,xmm2,[mem],xmm3 will perform xmm1 ← (xmm2*[mem])+xmm3 and require a W=0 encoding.
  • vfmaddsd xmm1,xmm2,xmm3,[mem] will perform xmm1 ← (xmm2*xmm3)+[mem] and require a W=1 encoding.
  • vfmaddsd xmm1,xmm2,xmm3,xmm4 will perform xmm1 ← (xmm2*xmm3)+xmm4 and can be encoded with either W=0 or W=1.


Opcode table
The 10 fused-multiply-add operations and the 122 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:

More information Basic operation, Opcode byte ...
  1. Vector register lanes are counted from 0 upwards in a little-endian manner – the lane that contains the first byte of the vector is considered to be even-numbered.
Remove ads

AVX-512

Summarize
Perspective

AVX-512, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory.[21] Most of the added instructions may also be used with the 256- and 128-bit registers.

AVX-512 foundation, byte/word and doubleword/quadword instructions (F, BW and DQ subsets)

This covers instructions that are new to AVX-512's F, BW and DQ subsets.

These AVX-512 subsets also include extended EVEX-encoded forms of a large number of MMX/SSE/AVX instructions - please see tables above.

Regularly-encoded floating-point instructions

These instructions all follow a given pattern where:

  • EVEX.W is used to specify floating-point format (0=FP32, 1=FP64)
  • The bottom opcode bit is used to select between packed and scalar operation (0: packed, 1:scalar)
  • For a given operation, all the scalar/packed variants belong to the same AVX-512 subset.
  • The instructions all support result masking by opmask registers. They also all support broadcast of memory operands for packed variants.
  • If AVX512VL is supported, then all vector widths (128-bit, 256-bit and 512-bit) are supported for packed variants.
More information ...
  1. For the VGETEXP* instructions:
    • If the source argument is ±0, then the instruction returns .
    • If the source argument is , then the instruction returns .
  2. The operation of the VRCP14* and VRSQRT14* approximation instructions is defined in a bit-exact manner − a C reference model is provided by Intel.[12]
  3. The normalization intervals supported for the VGETMANT* instructions are:
    More information , ...
    For the normalization interval, the input value will be scaled by a power of 4. If the input value is 0.0 or , then the result of VGETMANT* is 1.0, possibly modified by sign control.
  4. For the VGETMANT* instructions, if bit 3 of its imm8 argument is set and the input value is negative ( is not considered negative), then the result of VGETMANT* is converted to qNaN. Else if bit 2 is set, the sign bit of the result is cleared, else the input sign bit is kept.
  5. The floating-point classes that the VFIXUPIMM* instructions will recognize are:
    More information Number type, Classification index ...
    If MXCSR.DAZ is set, then denormals are classified as zero, else they are classified as positive/negative finite values.
  6. The response values that can be selected by the VFIXUPIMM* instructions' 4-bit table lookup result are:
    More information , ...
    If MXCSR.DAZ is set, then any denormal source values (but not destination values) are flushed to zero before the response-value is selected. (This affects responses 1 and 2.)
  7. The floating-point exceptions that can be indicated by the imm8 argument to the VFIXUPIMM* instruction are:
    More information #ZE if ...
    Denormal source values are treated as 0.0 if MXCSR.DAZ is set and as positive/negative finite values otherwise.
  8. For the VRANGE* instructions, bits 1:0 of the instruction's imm8 argument will pick a comparison and selection to perform as follows:
    More information Value, Meaning ...
    If both source arguments are , then for the purposes of this comparison and selection, is considered less than . If either input is , then the result is .
  9. For the VRANGE* instructions, bits 3:2 of the imm8 are used to select the sign-bit of the result as follows:
    More information Value, Meaning ...
  10. If the source argument to the VREDUCE* is , then the result will be .
  11. The imm8 argument to the VFPCLASS* instructions has the following layout:
    More information , ...
    If MXCSR.DAZ is set, then denormals are treated as ±0 for the classification test.

Opmask instructions

AVX-512 introduces, in addition to 512-bit vectors, a set of eight opmask registers, named k0,k1,k2...k7. These registers are 64 bits wide in implementations that support AVX512BW and 16 bits wide otherwise. They are mainly used to enable/disable operation on a per-lane basis for most of the AVX-512 vector instructions. They are usually set with vector-compare instructions or instructions that otherwise produce a 1-bit per-lane result as a natural part of their operation - however, AVX-512 defines a set of 55 new instructions to help assist manual manipulation of the opmask registers.

These instructions are, for the most part, defined in groups of 4 instructions, where the four instructions in a group are basically just 8-bit, 16-bit, 32-bit and 64-bit variants of the same basic operation (where only the low 8/16/32/64 bits of the registers participate in the given operation and, if a result is written back to a register, all bits except the bottom 8/16/32/64 bits are set to zero). The opmask instructions are all encoded with the VEX prefix (unlike all other AVX-512 instructions, which are encoded with the EVEX prefix).

In general, the 16-bit variants of the instructions are introduced by AVX512F (except KADDW and KTESTW), the 8-bit variants by the AVX512DQ extension, and the 32/64-bit variants by the AVX512BW extension.

Most of the instructions follow a very regular encoding pattern where the four instructions in a group have identical encodings except for the VEX.pp and VEX.W fields:

More information Instruction description, Basic opcode ...
  1. The 16-bit opmask instructions KADDW and KTESTW were introduced with AVX512DQ, not AVX512F.
  2. On processors that support Intel APX, all forms of the KMOV* instructions (but not any other opmask instructions) can be EVEX-encoded.
  3. The 32/64-bit KMOVD/KMOVQ instructions to move between opmask-registers and general-purpose registers do exist, but do not match the pattern of the opcodes in this table. See table below.

Not all of the opmask instructions fit the pattern above - the remaining ones are:

More information Instruction description, Instruction mnemonics ...
  1. For the KSHIFT* instructions, the imm8 shift-amount is not masked. Specifying a shift-amount greater than or equal to the operand size will produce an all-zeroes result.
  2. On processors that support Intel APX, all forms of the KMOV* instructions (but not any other opmask instructions) can be EVEX-encoded.
  3. KMOVQ instruction with 64-bit general-purpose register operand only available in x86-64 long mode. Instruction will execute as KMOVD in 32-bit mode.

Compare, test, blend, opmask-convert instructions

Vector-register instructions that use opmasks in ways other than just as a result writeback mask.

More information Description, Instruction mnemonics ...
  1. For the AVX-512 V(P)BLENDM* instructions, result write masking is not available - the EVEX-prefix opmask register argument that is normally used for write-masking with most other AVX-512 instructions is instead used for source selection.
  2. The VPMOVM2* and VPMOV*2M instructions do not support result masking by EVEX-encoded opmask register, requiring EVEX.aaa=0. The opmask registers actually specified for these instructions is specified through the ModR/M byte.

Data conversion instructions

More information Instruction description, Instruction mnemonics ...
  1. For instructions that perform conversions from unsigned 32-bit integer to FP64 (VCVTUDQ2PD and the W=0 variant of VCVTUSI2SD), EVEX embedded rounding controls are permitted but have no effect.
  2. Scalar conversions to/from 64-bit integer (VCVTSS2USI, VCVTSD2USI, VCVTTSS2USI, VCVTTSD2USI, VCVTUSI2SS, VCVTUSI2SD with EVEX.W=1 encoding) are only available in 64-bit "long mode". Otherwise, these instructions execute as if EVEX.W=0, resulting in 32-bit integer operation.

Data movement instructions

More information Instruction description, Instruction mnemonics ...
  1. Variants of the VPBROADCAST* instructions that can take their source from memory or an XMM register were introduced in AVX2.
  2. VPBROADCASTQ with 64-bit register source operand is available only in 64-bit long mode. In 32-bit mode, the instruction will execute as if EVEX.W=0, resulting in 32-bit operation.
  3. The V(P)BROADCAST* instructions perform broadcast as part of their normal operation - under AVX-512 with EVEX prefix, they do not need the EVEX.b modifier.

Shift/rotate instructions

More information Instruction description, Instruction mnemonics ...

Other AVX-512 foundation instructions

More information Perform floating-point rounding to a multiple of ...
Remove ads

AMX

Summarize
Perspective

Intel AMX adds eight new tile-registers, tmm0-tmm7, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a TILECFG register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.

More information AMX subset, Instruction mnemonics ...
  1. For TILEZERO, the tile-register to clear is specified by bits 5:3 of the instruction's ModR/M byte. Bits 7:6 must be set to 11b, and bits 2:0 must be set to 000b.
  2. For the TILELOADD, TILELOADDT1 and TILESTORED instructions, the memory argument must use a memory addressing mode with the SIB-byte. Under this addressing mode, the base register and displacement are used to specify the starting address for the first row of the tile to load/store from/to memory – the scale and index are used to specify a per-row stride.
    These instructions are all interruptible – an interrupt or memory exception taken in the middle of these instructions will cause progress tracking information to be written to TILECFG.start_row, so that the instruction may continue on a partially-loaded/stored tile after the interruption.
  3. For all of the AMX matrix multiply instructions, the three arguments are required to be three different tile registers, or else the instruction will #UD.
Remove ads

See also

References

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.

Remove ads