X86 SIMD instruction listings

The x86 instruction set has several times been extended with SIMD (Single instruction, multiple data) instruction set extensions. These extensions, starting from the MMX instruction set extension introduced with Pentium MMX in 1997, typically define sets of wide registers and instructions that subdivide these registers into fixed-size lanes and perform a computation for each lane in parallel.

Remove ads

Summary of SIMD extensions

Summarize

Perspective

The main SIMD instruction set extensions that have been introduced for x86 are:

More information SIMD instruction set extension, Year ...

SIMD instruction set extension	Year	Description	Added in
MMX	1997	A set of 57 integer SIMD instruction acting on 64-bit vectors, mostly providing 8/16/32-bit lane-width operations. Repurposed the old x87 FPU register-file as a bank of eight 64-bit vector registers, referred to as MM0..MM7 when used for MMX instructions.	Intel Pentium MMX, AMD K6, Intel Pentium II, Cyrix 6x86MX, MediaGXm, Rise mP6, IDT WinChip C6, Transmeta Crusoe, DM&P Vortex86MX
SSE Streaming SIMD Extensions	1999	"Katmai New Instructions" - introduced a set of 70 new instructions. Most but not all of these instructions provide scalar and vector operations on 32-bit floating-point values in 128-bit SIMD vector registers. (Some of the SSE instructions were instead new MMX instructions and non-SIMD instructions such as `SFENCE` - the subset of SSE that excludes the 128-bit SIMD register instructions is known as "MMX+", and is supported on some AMD processors that didn't implement full SSE, notably early Athlons and Geode LX.) SSE introduced a new set of eight vector registers XMM0..XMM7, each 128 bits, and a status/control register MXCSR. This set of eight vector registers would later be extended to 16 registers with the introduction of x86-64.	Intel Pentium III, AMD Athlon XP, VIA C3 "Nehemiah", Transmeta Efficeon
SSE2 Streaming SIMD Extensions 2	2000	Extended SSE with 144 new instructions - mainly additional instructions to work on scalars and vectors of 64-bit floating-point values, as well as 128-bit-vector forms of most of the MMX integer instructions.	Intel Pentium 4, Intel Pentium M, AMD Athlon 64, Transmeta Efficeon, VIA C7
SSE3 Streaming SIMD Extensions 3	2004	"Prescott New Instructions": added a set of 13 new instructions,^[a] mostly horizontal add/subtract operations.	Intel Pentium 4 "Prescott", Transmeta Efficeon 8800, AMD Athlon 64 "Venice", VIA C7, Intel Core "Yonah"
SSSE3 Supplemental SSE3	2006	Added a set of 32 new instructions to extend MMX and SSE, including a byte-shuffle instruction.	Intel Core 2 "Conroe"/"Merom", VIA Nano 2000, Intel Atom "Bonnell", AMD "Bobcat", AMD FX "Bulldozer"
SSE4a	2007	AMD-only extension that added a set of 4 instructions, including bitfield insert/extract and scalar non-temporal store instructions.	AMD K10
SSE4.1	2007	Added a set of 47 instructions, including variants of integer min/max, widening integer conversions, vector lane insert/extract, and dot-product instructions.	Intel Core 2 "Penryn", VIA Nano 3000, AMD FX "Bulldozer", AMD "Jaguar", Intel Atom "Silvermont", Zhaoxin ZX-A
SSE4.2	2008	Added a set of 7 instructions, mostly pertaining to string processing.	Intel Core i7 "Nehalem", AMD FX "Bulldozer", AMD "Jaguar", Intel Atom "Silvermont", VIA Nano QuadCore C4000, Zhaoxin ZX-C "ZhangJiang"
AVX Advanced Vector Extensions	2011	Extended the XMM0..XMM15 vector registers to 256-bit registers, referred to as YMM0..YMM15 when used as full 256-bit registers. Added three-operand variants of most of the SSE1-4 vector instructions, as well as 256-bit vector variants of most of the SSE1-4 vector instructions acting on 32/64-bit floating-point values. These new instruction variants are all encoded with the new VEX prefix.	Intel Core i7 "Sandy Bridge", AMD FX "Bulldozer", AMD "Jaguar", VIA Nano QuadCore C4000, Zhaoxin ZX-C "ZhangJiang", Intel Atom "Gracemont"
FMA3 Fused Multiply-Add	2013	Added three-operand floating-point fused-multiply add operations, scalar and vector variants.	Intel Core i7 "Haswell", AMD FX "Piledriver", Intel Atom "Gracemont", Zhaoxin KH-40000 "YongFeng"
AVX2 Advanced Vector Extensions 2	2013	Added 256-bit vector variants of most of the MMX/SSE1-4 vector integer instructions. Also adds vector gather instructions.	Intel Core i7 "Haswell", AMD FX "Excavator", VIA Nano QuadCore C4000, Intel Atom "Gracemont", Zhaoxin KH-40000 "YongFeng"^[b]
AVX-512	2016	Extended the YMM0..YMM15 vector registers to a set of 32 registers, each 512-bits wide - referred to as ZMM0..ZMM31 when used as 512-bit registers. Also added eight opmask registers K0..K7. Added 512-bit versions of most of the MMX/SSE/AVX vector instructions, as well as a substantial number of additional instructions. These are mostly encoded with the new EVEX prefix (except for opmask management instructions, which continue to use the VEX prefix.) Added the ability to perform per-vector-lane masking of the operation of most of its vector instructions, by using the opmask registers. Also added embedded rounding controls for floating-point instructions and a scalar-to-vector broadcast function for most instructions that can accept memory operands.	AVX512 Foundation: Intel Xeon Phi x200, Intel Core i9 "Skylake-X", AMD Zen 4 (See AVX-512#New instructions by sets for additional subsets.)
AMX Advanced Matrix Extensions	2023	Added a set of eight new tile registers, referred to as TMM0..TMM7. Each of these tile registers has a size of 8192 bits (16 rows of 64 bytes each). Also added a 64-byte tile configuration register TILECFG, and instructions to perform matrix multiplication on the tile registers with various data formats.	Intel Xeon "Sapphire Rapids"
AVX10.1	2024	Reformulation of AVX-512 that includes most of the optional AVX-512 subsets (F,CD,BW,DQ,VL,IFMA,VBMI,VNNI,BF16,VBMI2,BITALG,VPOPCNTDQ,FP16) as baseline functionality, and switches feature enumeration from the flag-based scheme of AVX-512 to a version-based scheme.^[c] No new instructions are added.	Intel Xeon 6 "Granite Rapids"
AVX10.2	(2025)	Adds instructions to convert to/from MXFP8 datatypes, perform arithmetic on BF16 numbers, saturating conversions from floating-point to integer, IEEE754-compliant min/max, and a few other instructions.	(Intel Diamond Rapids)

[a]
The count of 13 instructions for SSE3 includes the non-SIMD instructions MONITOR and MWAIT that were also introduced as part of "Prescott New Instructions" - these two instructions are considered to be SSE3 instructions by Intel but not by AMD.
[b]
On older Zhaoxin processors, such as KX-6000 "LuJiaZui", AVX2 instructions are present but not exposed through CPUID due to the lack of FMA3 support.^[1]
[c]
Early drafts of the AVX10 specification also added an option for implementations to limit the maximum supported vector-register width to 128/256 bits^[2] - however, as of March 2025, this option has been removed, making support for 512-bit vector-register width mandatory again.^[3]^[4]

Remove ads

MMX instructions and extended variants thereof

Summarize

Perspective

These instructions are, unless otherwise noted, available in the following forms:

MMX: 64-bit vectors, operating on mm0..mm7 registers (aliased on top of the old x87 register file)
SSE2: 128-bit vectors, operating on xmm0..xmm15 registers (xmm0..xmm7 in 32-bit mode)
AVX: 128-bit vectors, operating on xmm0..xmm15 registers, with a new three-operand encoding enabled by the new VEX prefix. (AVX introduced 256-bit vector registers, but the full width of these vectors was in general not made available for integer SIMD instructions until AVX2.)
AVX2: 256-bit vectors, operating on ymm0..ymm15 registers (extended versions of the xmm0..xmm15 registers)
AVX-512: 512-bit vectors, operating on zmm0..zmm31 registers (zmm0..zmm15 are extended versions of the ymm0..ymm15 registers, while zmm16..zmm31 are new to AVX-512). AVX-512 also introduces opmasks, allowing the operation of most instructions to be masked on a per-lane basis by an opmask register (the lane width varies from one instruction to another). AVX-512 also adds broadcast functionality for many of its instructions - this is used with memory source arguments to replicate a single value to all lanes of a vector calculation. The tables below provide indications of whether opmasks and broadcasts are supported for each instruction, and if so, what lane-widths they are using.

For many of the instruction mnemonics, (V) is used to indicate that the instruction mnemonic exists in forms with and without a leading V - the form with the leading V is used for the VEX/EVEX-prefixed instruction variants introduced by AVX/AVX2/AVX-512, while the form without the leading V is used for legacy MMX/SSE encodings without VEX/EVEX-prefix.

Original Pentium MMX instructions, and SSE2/AVX/AVX-512 extended variants thereof

More information Description, Instruction mnemonics ...

Description		Instruction mnemonics	Basic opcode	MMX (no prefix)	SSE2 (66h prefix)	AVX (VEX.66 prefix)	AVX-512 (EVEX.66 prefix)
Description		Instruction mnemonics	Basic opcode	MMX (no prefix)	SSE2 (66h prefix)	AVX (VEX.66 prefix)	supported	subset	lane	bcst
Empty MMX technology state. Mark all the FP/MMX registers as Empty, so that they can be freely used by later x87 code.^[a]		`EMMS` (MMX)	`0F 77`	Yes	No	No^[b]	No	—	—	—

Move scalar value from GPR (general-purpose register) or memory to vector register, with zero-fill	32-bit	`(V)MOVD mm, r/m32`	`0F 6E /r`	Yes	Yes	Yes (L=0,W=0)	Yes (L=0,W=0)	F	No	No
	64-bit (x86-64)	`(V)MOVQ mm, r/m64`, `MOVD mm, r/m64`^[c]	`0F 6E /r`	Yes (REX.W)	Yes (REX.W)^[d]	Yes (L=0,W=1)	Yes (L=0,W=1)	F	No	No
Move scalar value from vector register to GPR or memory	32-bit	`(V)MOVD r/m32, mm`	`0F 7E /r`	Yes	Yes	Yes (L=0,W=0)	Yes (L=0,W=0)	F	No	No
Move scalar value from vector register to GPR or memory	64-bit (x86-64)	`(V)MOVQ r/m64, mm`, `MOVD r/m64, mm`^[c]	`0F 7E /r`	Yes (REX.W)	Yes (REX.W)^[d]	Yes (L=0,W=1)	Yes (L=0,W=1)	F	No	No
Vector move between vector register and either memory or another vector register. For move to/from memory, the memory address is required to be aligned for `(V)MOVDQA` variants but not for `MOVQ`. The 128-bit VEX-encoded forms of `VMOVDQA` with a memory argument will, if the memory is cacheable, perform their memory accesses atomically.^[e]		`MOVQ mm/m64, mm`(MMX) `(V)MOVDQA xmm/m128,xmm`	`0F 7F /r`	`MOVQ`	`MOVDQA`	`VMOVDQA`^[f]	`VMOVDQA32`(W0)	F	32	No
		`MOVQ mm/m64, mm`(MMX) `(V)MOVDQA xmm/m128,xmm`	`0F 7F /r`	`MOVQ`	`MOVDQA`	`VMOVDQA`^[f]	`VMOVDQA64`(W1)	F	64	No
		`MOVQ mm, mm/m64`(MMX) `(V)MOVDQA xmm,xmm/m128`	`0F 6F /r`	`MOVQ`	`MOVDQA`	`VMOVDQA`^[f]	`VMOVDQA32`(W0)	F	32	No
		`MOVQ mm, mm/m64`(MMX) `(V)MOVDQA xmm,xmm/m128`	`0F 6F /r`	`MOVQ`	`MOVDQA`	`VMOVDQA`^[f]	`VMOVDQA64`(W1)	F	64	No

Pack 32-bit signed integers to 16-bit, with saturation		`(V)PACKSSDW mm, mm/m64`^[g]	`0F 6B /r`	Yes	Yes	Yes	Yes (W=0)	BW	16	32
Pack 16-bit signed integers to 8-bit, with saturation		`(V)PACKSSWB mm, mm/m64`^[g]	`0F 63 /r`	Yes	Yes	Yes	Yes	BW	8	No
Pack 16-bit unsigned integers to 8-bit, with saturation		`(V)PACKUSWB mm, mm/m64`^[g]	`0F 67 /r`	Yes	Yes	Yes	Yes	BW	8	No
Unpack and interleave packed integers from the high halves of two input vectors	8-bit	`(V)PUNPCKHBW mm, mm/m64`^[g]	`0F 68 /r`	Yes	Yes	Yes	Yes	BW	8	No
	16-bit	`(V)PUNPCKHWD mm, mm/m64`^[g]	`0F 69 /r`	Yes	Yes	Yes	Yes	BW	16	No
	32-bit	`(V)PUNPCKHDQ mm, mm/m64`^[g]	`0F 6A /r`	Yes	Yes	Yes	Yes (W=0)	F	32	32
Unpack and interleave packed integers from the low halves of two input vectors	8-bit	`(V)PUNPCKLBW mm, mm/m32`^[g]^[h]	`0F 60 /r`	Yes	Yes	Yes	Yes	BW	8	No
	16-bit	`(V)PUNPCKLWD mm, mm/m32`^[g]^[h]	`0F 61 /r`	Yes	Yes	Yes	Yes	BW	16	No
	32-bit	`(V)PUNPCKLDQ mm, mm/m32`^[g]^[h]	`0F 62 /r`	Yes	Yes	Yes	Yes (W=0)	F	32	32

Add packed integers	8-bit	`(V)PADDB mm, mm/m64`	`0F FC /r`	Yes	Yes	Yes	Yes	BW	8	No
	16-bit	`(V)PADDW mm, mm/m64`	`0F FD /r`	Yes	Yes	Yes	Yes	BW	16	No
	32-bit	`(V)PADDD mm, mm/m64`	`0F FE /r`	Yes	Yes	Yes	Yes (W=0)	F	32	32
Add packed signed integers with saturation	8-bit	`(V)PADDSB mm, mm/m64`	`0F EC /r`	Yes	Yes	Yes	Yes	BW	8	No
Add packed signed integers with saturation	16-bit	`(V)PADDSW mm, mm/m64`	`0F ED /r`	Yes	Yes	Yes	Yes	BW	16	No
Add packed unsigned integers with saturation	8-bit	`(V)PADDUSB mm, mm/m64`	`0F DC /r`	Yes	Yes	Yes	Yes	BW	8	No
Add packed unsigned integers with saturation	16-bit	`(V)PADDUSW mm, mm/m64`	`0F DD /r`	Yes	Yes	Yes	Yes	BW	16	No

Subtract packed integers	8-bit	`(V)PSUBB mm, mm/m64`	`0F F8 /r`	Yes	Yes	Yes	Yes	BW	8	No
	16-bit	`(V)PSUBW mm, mm/m64`	`0F F9 /r`	Yes	Yes	Yes	Yes	BW	16	No
	32-bit	`(V)PSUBD mm, mm/m64`	`0F FA /r`	Yes	Yes	Yes	Yes (W=0)	F	32	32
Subtract packed signed integers with saturation	8-bit	`(V)PSUBSB mm, mm/m64`	`0F E8 /r`	Yes	Yes	Yes	Yes	BW	8	No
Subtract packed signed integers with saturation	16-bit	`(V)PSUBSW mm, mm/m64`	`0F E9 /r`	Yes	Yes	Yes	Yes	BW	16	No
Subtract packed unsigned integers with saturation	8-bit	`(V)PSUBUSB mm, mm/m64`	`0F D8 /r`	Yes	Yes	Yes	Yes	BW	8	No
Subtract packed unsigned integers with saturation	16-bit	`(V)PSUBUSW mm, mm/m64`	`0F D9 /r`	Yes	Yes	Yes	Yes	BW	16	No

Compare packed integers for equality	8-bit	`(V)PCMPEQB mm, mm/m64`	`0F 74 /r`	Yes	Yes	Yes	Yes^[i]	BW	8	No
	16-bit	`(V)PCMPEQW mm, mm/m64`	`0F 75 /r`	Yes	Yes	Yes	Yes^[i]	BW	16	No
	32-bit	`(V)PCMPEQD mm, mm/m64`	`0F 76 /r`	Yes	Yes	Yes	Yes (W=0)^[i]	F	32	32
Compare packed integers for signed greater-than	8-bit	`(V)PCMPGTB mm, mm/m64`	`0F 64 /r`	Yes	Yes	Yes	Yes^[i]	BW	8	No
	16-bit	`(V)PCMPGTW mm, mm/m64`	`0F 65 /r`	Yes	Yes	Yes	Yes^[i]	BW	16	No
	32-bit	`(V)PCMPGTD mm, mm/m64`	`0F 66 /r`	Yes	Yes	Yes	Yes (W=0)^[i]	F	32	32

Multiply packed 16-bit signed integers, add results pairwise into 32-bit integers		`(V)PMADDWD mm, mm/m64`	`0F F5 /r`	Yes	Yes	Yes	Yes^[j]	BW	32	No
Multiply packed 16-bit signed integers, store high 16 bits of results		`(V)PMULHW mm, mm/m64`	`0F E5 /r`	Yes	Yes	Yes	Yes	BW	16	No
Multiply packed 16-bit integers, store low 16 bits of results		`(V)PMULLW mm, mm/m64`	`0F D5 /r`	Yes	Yes	Yes	Yes	BW	16	No

Vector bitwise AND		`(V)PAND mm, mm/m64`	`0F DB /r`	Yes	Yes	Yes	`VPANDD`(W0)	F	32	32
Vector bitwise AND		`(V)PAND mm, mm/m64`	`0F DB /r`	Yes	Yes	Yes	`VPANDQ`(W1)	F	64	64
Vector bitwise AND-NOT		`(V)PANDN mm, mm/m64`	`0F DF /r`	Yes	Yes	Yes	`VPANDND`(W0)	F	32	32
Vector bitwise AND-NOT		`(V)PANDN mm, mm/m64`	`0F DF /r`	Yes	Yes	Yes	`VPANDNQ`(W1)	F	64	64
Vector bitwise OR		`(V)POR mm, mm/m64`	`0F EB /r`	Yes	Yes	Yes	`VPORD`(W0)	F	32	32
Vector bitwise OR		`(V)POR mm, mm/m64`	`0F EB /r`	Yes	Yes	Yes	`VPORQ`(W1)	F	64	64
Vector bitwise XOR		`(V)PXOR mm, mm/m64`	`0F EE /r`	Yes	Yes	Yes	`VPXORD`(W0)	F	32	32
Vector bitwise XOR		`(V)PXOR mm, mm/m64`	`0F EE /r`	Yes	Yes	Yes	`VPXORQ`(W1)	F	64	64

Left-shift of packed integers, with common shift-amount	16-bit	`(V)PSLLW mm, imm8`	`0F 71 /6 ib`	Yes	Yes	Yes	Yes	BW	16	No
	16-bit	`(V)PSLLW mm, mm/m64`^[k]	`0F F1 /r`	Yes	Yes	Yes	Yes	BW	16	No
	32-bit	`(V)PSLLD mm, imm8`	`0F 72 /6 ib`	Yes	Yes	Yes	Yes (W=0)	F	32	32
	32-bit	`(V)PSLLD mm, mm/m64`^[k]	`0F F2 /r`	Yes	Yes	Yes	Yes (W=0)	F	32	No
	64-bit	`(V)PSLLQ mm, imm8`	`0F 73 /6 ib`	Yes	Yes	Yes	Yes (W=1)	F	64	64
	64-bit	`(V)PSLLQ mm, mm/m64`^[k]	`0F F3 /r`	Yes	Yes	Yes	Yes (W=1)	F	64	No
Right-shift of packed signed integers, with common shift-amount	16-bit	`(V)PSRAW mm, imm8`	`0F 71 /4 ib`	Yes	Yes	Yes	Yes	BW	16	No
	16-bit	`(V)PSRAW mm, mm/m64`^[k]	`0F E1 /r`	Yes	Yes	Yes	Yes	BW	16	No
	32-bit	`(V)PSRAD mm, imm8`	`0F 72 /4 ib`	Yes	Yes	Yes	Yes (W=0)	F	32	32
	32-bit	`(V)PSRAD mm, mm/m64`^[k]	`0F E2 /r`	Yes	Yes	Yes	Yes (W=0)	F	32	No
Right-shift of packed unsigned integers, with common shift-amount	16-bit	`(V)PSRLW mm, imm8`	`0F 71 /2 ib`	Yes	Yes	Yes	Yes	BW	16	No
	16-bit	`(V)PSRLW mm, mm/m64`^[k]	`0F D1 /r`	Yes	Yes	Yes	Yes	BW	16	No
	32-bit	`(V)PSRLD mm, imm8`	`0F 72 /2 ib`	Yes	Yes	Yes	Yes (W=0)	F	32	32
	32-bit	`(V)PSRLD mm, mm/m64`^[k]	`0F D2 /r`	Yes	Yes	Yes	Yes (W=0)	F	32	No
	64-bit	`(V)PSRLQ mm, imm8`	`0F 73 /2 ib`	Yes	Yes	Yes	Yes (W=1)	F	64	64
	64-bit	`(V)PSRLQ mm, mm/m64`^[k]	`0F D3 /r`	Yes	Yes	Yes	Yes (W=1)	F	64	No

[a]
EMMS will also set the x87 top-of-stack to 0.
Unlike the older FNINIT instruction, EMMS will not update the FPU Control Word, nor will it update any part of the FPU Status Register other than the top-of-stack. If there are any unmasked pending x87 exceptions, EMMS will raise the exception while FNINIT will clear it.
[b]
The 0F 77 opcode can be VEX-encoded (resulting in the AVX VZEROUPPER and VZEROALL instructions), but this requires a VEX.NP prefix, not a VEX.66 prefix.
[c]
The 64-bit move instruction forms that are encoded by using a REX.W prefix with the 0F 6E and 0F 7E opcodes are listed with different mnemonics in Intel and AMD documentation — MOVQ in Intel documentation^[5] and MOVD in AMD documentation.^[6]
This is a documentation difference only — the operation performed by these opcodes is the same for Intel and AMD.
This documentation difference applies only to the MMX/SSE forms of these opcodes — for VEX/EVEX-encoded forms, both Intel and AMD use the mnemonic VMOVQ.)
[d]
The REX.W-encoded variants of MOVQ are available in 64-bit "long mode" only. For SSE2 and later, MOVQ to and from xmm/ymm/zmm registers can also be encoded with F3 0F 7E /r and 66 0F D6 /r respectively - these encodings are shorter and available outside 64-bit mode.
[e]
On all Intel,^[7] AMD^[8] and Zhaoxin^[9] processors that support AVX, the 128-bit forms of VMOVDQA (encoded with a VEX prefix and VEX.L=0) are, when used with a memory argument addressing WB (write-back cacheable) memory, architecturally guaranteed to perform the 128-bit memory access atomically - this applies to both load and store.
(Intel and AMD provide somewhat wider guarantees covering more 128-bit instruction variants, but Zhaoxin provides the guarantee for cacheable VMOVDQA only.)
While 128-bit VMOVDQA is atomic, it is not locked — it can be reordered in the same way as normal x86 loads/stores (e.g. loads passing older stores).
On processors that support SSE but don't support AVX, the 128-bit forms of SSE load/store instructions such as MOVAPS/MOVAPD/MOVDQA are not guaranteed to execute atomically — examples of processors where such instructions have been observed to execute non-atomically include Intel Core Duo and AMD K10.^[10]
[f]
VMOVDQA is available with a vector length of 256 bits under AVX, not requiring AVX2.
Unlike the 128-bit form, the 256-bit form of VMOVDQA does not provide any special atomicity guarantees.
[g]
For the VPACK* and VPUNPCK* instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
[h]
For the memory argument forms of (V)PUNPCKL* instructions, the memory argument is half-width only for the MMX variants of the instructions. For SSE/AVX/AVX-512 variants, the width of the memory argument is the full vector width even though only half of it is actually used.
[i]
The EVEX-encoded variants of the VPCMPEQ* and VPCMPGT* instructions write their results to AVX-512 opmask registers. This differs from the older non-EVEX variants, which write comparison results as vectors of all-0s/all-1s values to the regular mm/xmm/ymm vector registers.
[j]
The (V)PMADDWD instruction will add multiplication results pairwise, but will not add the sum to an accumulator. AVX512_VNNI provides the instructions VDPWSSD and WDPWSSDS, which will add multiplication results pairwise, and then also add them to a per-32-bit-lane accumulator.
[k]
For the MMX packed shift instructions PSLL* and PSR* with a shift-argument taken from a vector source (mm or m64), the shift-amount is considered to be a single 64-bit scalar value - the same shift-amount is used for all lanes of the destination vector. This shift-amount is unsigned and is not masked - all bits are considered (e.g. a shift-amount of 0x80000000_00000000 can be specified and will have the same effect as a shift-amount of 64).
For all SSE2/AVX/AVX512 extended variants of these instructions, the shift-amount vector argument is considered to be a 128-bit (xmm or m128) argument - the bottom 64 bits are used as the shift-amount.
Packed shift-instructions that can take a variable per-lane shift-amount were introduced in AVX2 for 32/64-bit lanes and AVX512BW for 16-bit lanes (VPSLLV*, VPSRLV*, VPSRAV* instructions).

MMX instructions added with MMX+/SSE/SSE2/SSSE3, and SSE2/AVX/AVX-512 extended variants thereof

More information Description, Instruction mnemonics ...

Description		Instruction mnemonics	Basic opcode	MMX (no prefix)	SSE2 (66h prefix)	AVX (VEX.66 prefix)	AVX-512 (EVEX.66 prefix)
Description		Instruction mnemonics	Basic opcode	MMX (no prefix)	SSE2 (66h prefix)	AVX (VEX.66 prefix)	supported	subset	lane	bcst

Added with SSE and MMX+
Perform shuffle of four 16-bit integers in 64-bit vector (MMX)^[a]		`PSHUFW mm,mm/m64,imm8`(MMX)	`0F 70 /r ib`	`PSHUFW`	`PSHUFD`	`VPSHUFD`	`VPSHUFD` (W=0)	F	32	32
Perform shuffle of four 32-bit integers in 128-bit vector (SSE2)		`(V)PSHUFD xmm,xmm/m128,imm8`^[b]	`0F 70 /r ib`	`PSHUFW`	`PSHUFD`	`VPSHUFD`	`VPSHUFD` (W=0)	F	32	32
Insert integer into 16-bit vector register lane		`(V)PINSRW mm,r32/m16,imm8`	`0F C4 /r ib`	Yes	Yes	Yes (L=0,W=0^[c])	Yes (L=0)	BW	No	No
Extract integer from 16-bit vector register lane, with zero-extension		`(V)PEXTRW r32,mm,imm8`^[d]	`0F C5 /r ib`	Yes	Yes	Yes (L=0,W=0^[c])	Yes (L=0)	BW	No	No
Create a bitmask made from the top bit of each byte in the source vector, and store to integer register		`(V)PMOVMSKB r32,mm`	`0F D7 /r`	Yes	Yes	Yes	No^[e]	—	—	—
Minimum-value of packed unsigned 8-bit integers		`(V)PMINUB mm,mm/m64`	`0F DA /r`	Yes	Yes	Yes	Yes	BW	8	No
Maximum-value of packed unsigned 8-bit integers		`(V)PMAXUB mm,mm/m64`	`0F DE /r`	Yes	Yes	Yes	Yes	BW	8	No
Minimum-value of packed signed 16-bit integers		`(V)PMINSW mm,mm/m64`	`0F EA /r`	Yes	Yes	Yes	Yes	BW	16	No
Maximum-value of packed signed 16-bit integers		`(V)PMAXSW mm,mm/m64`	`0F EE /r`	Yes	Yes	Yes	Yes	BW	16	No
Rounded average of packed unsigned integers. The per-lane operation is: `dst ← (src1 + src2 + 1)>>1`	8-bit	`(V)PAVGB mm,mm/m64`	`0F E0 /r`	Yes	Yes	Yes	Yes	BW	8	No
	16-bit	`(V)PAVGW mm,mm/m64`	`0F E3 /r`	Yes	Yes	Yes	Yes	BW	16	No
Multiply packed 16-bit unsigned integers, store high 16 bits of results		`(V)PMULHUW mm,mm/mm64`	`0F E4 /r`	Yes	Yes	Yes	Yes	BW	16	No
Store vector register to memory using Non-Temporal Hint. Memory operand required to be aligned for all `(V)MOVNTDQ` variants, but not for `MOVNTQ`.		`MOVNTQ m64,mm`(MMX) `(V)MOVNTDQ m128,xmm`	`0F E7 /r`	`MOVNTQ`	`MOVNTDQ`	`VMOVNTDQ`^[f]	`VMOVNTDQ` (W=0)	F	No	No
Compute sum of absolute differences for eight 8-bit unsigned integers, storing the result as a 64-bit integer. For vector widths wider than 64 bits (SSE/AVX/AVX-512), this calculation is done separately for each 64-bit lane of the vectors, producing a vector of 64-bit integers.		`(V)PSADBW mm,mm/m64`	`0F F6 /r`	Yes	Yes	Yes	Yes	BW	No	No
Unaligned store vector register to memory using byte write-mask, with Non-Temporal Hint. First argument provides data to store, second argument provides byte write-mask (top bit of each byte).^[g] Address to store to is given by DS:DI/EDI/RDI (DS: segment overridable with segment-prefix).		`MASKMOVQ mm,mm`(MMX) `(V)MASKMOVDQU xmm,xmm`	`0F F7 /r`	`MASKMOVQ`	`MASKMOVDQU`	`VMASKMOVDQU` (L=0)^[h]	No^[i]	—	—	—
Added with SSE2
Multiply packed 32-bit unsigned integers, store full 64-bit result. The input integers are taken from the low 32 bits of each 64-bit vector lane.		`(V)PMULUDQ mm,mm/m64`	`0F F4 /r`	Yes	Yes	Yes	Yes (W=1)	F	64	64
Add packed 64-bit integers		`(V)PADDQ mm, mm/m64`	`0F D4 /r`	Yes	Yes	Yes	Yes (W=1)	F	64	64
Subtract packed 64-bit integers		`(V)PSUBQ mm,mm/m64`	`0F FB /r`	Yes	Yes	Yes	Yes (W=1)	F	64	64
Added with SSSE3
Vector Byte Shuffle		`(V)PSHUFB mm,mm/m64`^[b]	`0F38 00 /r`	Yes	Yes^[j]	Yes	Yes	BW	8	No
Pairwise horizontal add of packed integers	16-bit	`(V)PHADDW mm,mm/mm64`^[b]	`0F38 01 /r`	Yes	Yes	Yes	No	—	—	—
Pairwise horizontal add of packed integers	32-bit	`(V)PHADDD mm,mm/mm64`^[b]	`0F38 02 /r`	Yes	Yes	Yes	No	—	—	—
Pairwise horizontal add of packed 16-bit signed integers, with saturation		`(V)PHADDSW mm,mm/mm64`^[b]	`0F38 03 /r`	Yes	Yes	Yes	No	—	—	—
Multiply packed 8-bit signed and unsigned integers, add results pairwise into 16-bit signed integers with saturation. First operand is treated as unsigned, second operand as signed.		`(V)PMADDUBSW mm,mm/m64`	`0F38 04 /r`	Yes	Yes	Yes	Yes	BW	16	No
Pairwise horizontal subtract of packed integers. The higher-order integer of each pair is subtracted from the lower-order integer.	16-bit	`(V)PHSUBW mm,mm/m64`^[b]	`0F38 05 /r`	Yes	Yes	Yes	No	—	—	—
	32-bit	`(V)PHSUBD mm,mm/m64`^[b]	`0F38 06 /r`	Yes	Yes	Yes	No	—	—	—
Pairwise horizontal subtract of packed 16-bit signed integers, with saturation		`(V)PHSUBSW mm,mm/m64`^[b]	`0F38 07 /r`	Yes	Yes	Yes	No	—	—	—
Modify packed integers in first source argument based on the sign of packed signed integers in second source argument. The per-lane operation performed is: if( src2 < 0 ) dst ← -src1 else if( src2 == 0 ) dst ← 0 else dst ← src1	8-bit	`(V)PSIGNB mm,mm/m64`	`0F38 08 /r`	Yes	Yes	Yes	No	—	—	—
	16-bit	`(V)PSIGNW mm,mm/m64`	`0F38 09 /r`	Yes	Yes	Yes	No	—	—	—
	32-bit	`(V)PSIGND mm,mm/m64`	`0F38 0A /r`	Yes	Yes	Yes	No	—	—	—
Multiply packed 16-bit signed integers, then perform rounding and scaling to produce a 16-bit signed integer result. The calculation performed per 16-bit lane is: `dst ← (src1*src2 + (1<<14)) >> 15`		`(V)PMULHRSW mm,mm/m64`	`0F38 0B /r`	Yes	Yes	Yes	Yes	BW	16	No
Absolute value of packed signed integers	8-bit	`(V)PABSB mm,mm/m64`	`0F38 1C /r`	Yes	Yes	Yes	Yes	BW	8	No
	16-bit	`(V)PABSW mm,mm/m64`	`0F38 1D /r`	Yes	Yes	Yes	Yes	BW	8	No
	32-bit	`(V)PABSD mm,mm/m64`	`0F38 1E /r`	`PABSD`	`PABSD`	`VPABSD`	`VPABSD`(W0)	F	32	32
	64-bit	`VPABSQ xmm,xmm/m128`(AVX-512)	`0F38 1E /r`	`PABSD`	`PABSD`	`VPABSD`	`VPABSQ`(W1)	F	64	64
Packed Align Right. Concatenate two input vectors into a double-size vector, then right-shift by the number of bytes specified by the imm8 argument. The shift-amount is not masked - if the shift-amount is greater than the input vector size, zeroes will be shifted in.		`(V)PALIGNR mm,mm/mm64,imm8`^[b]	`0F3A 0F /r ib`	Yes	Yes	Yes	Yes^[k]	BW	8	No

[a]
For shuffle of four 16-bit integers in a 64-bit section of a 128-bit XMM register, the SSE2 instructions PSHUFLW (opcode F2 0F 70 /r) or PSHUFHW (opcode F3 0F 70 /r) may be used.
[b]
For the VPSHUFD, VPSHUFB, VPHADD*, VPHSUB* and VPALIGNR instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
[c]
For the VEX-encoded forms of the VPINSRW and VPEXTRW instruction, the Intel SDM (as of rev 084) indicates that the instructions must be encoded with VEX.W=0, however neither Intel XED nor AMD APM indicate any such requirement.
[d]
The 0F C5 /r ib variant of PEXTRW allows register destination only. For SSE4.1 and later, a variant that allows a memory destination is available with the opcode 66 0F 3A 15 /r ib.
[e]
EVEX-prefixed opcode not available. Under AVX-512, a bitmask made from the top bit of each byte can instead be constructed with the VPMOVB2M instruction, with opcode EVEX.F3.0F38.W0 29 /r, which will store such a bitmask to an opmask register.
[f]
VMOVNTDQ is available with a vector length of 256 bits under AVX, not requiring AVX2.
[g]
For the MASKMOVQ and (V)MASKMOVDQU instructions, exception and trap behavior for disabled lanes is implementation-dependent. For example, a given implementation may signal a data breakpoint or a page fault for bytes that are zero-masked and not actually written.
[h]
For AVX, masked stores to memory are also available using the VMASKMOVPS instruction with opcode VEX.66.0F38 2E /r - unlike VMASKMOVDQU, this instruction allows 256-bit stores without temporal hints, although its mask is coarser - 4 bytes vs 1 byte per lane.
[i]
Opcode not available under AVX-512. Under AVX-512, unaligned masked stores to memory (albeit without temporal hints) can be done with the VMOVDQU(8|16|32|64) instructions with opcode EVEX.F2/F3.0F 7F /r, using an opmask register to provide a write mask.
[j]
For AVX2 and AVX-512 with vectors wider than 128 bits, the VPSHUFB instruction is restricted to byte-shuffle within each 128-bit lane. Instructions that can do shuffles across 128-bit lanes include e.g. AVX2's VPERMD (shuffle of 32-bit lanes across 256-bit YMM register) and AVX512_VBMI's VPERMB (full byte shuffle across 64-byte ZMM register).
[k]
For AVX-512, VPALIGNR is supported but will perform its operation within each 128-bit lane. For packed alignment shifts that can shift data across 128-bit lanes, AVX512F's VALIGND instruction may be used, although its shift-amount is specified in units of 32-bits rather than bytes.

Remove ads

SSE instructions and extended variants thereof

Summarize

Perspective

Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof

For the instructions in the below table, the following considerations apply unless otherwise noted:

Packed instructions are available at all vector lengths (128-bit for SSE2, 128/256-bit for AVX, 128/256/512-bit for AVX-512)
FP32 variants of instructions are introduced as part of SSE. FP64 variants of instructions are introduced as part of SSE2.
The AVX-512 variants of the FP32 and FP64 instructions are introduced as part of the AVX512F subset.
For AVX-512 variants of the instructions, opmasks and broadcasts are available with a width of 32 bits for FP32 operations and 64 bits for FP64 operations. (Broadcasts are available for vector operations only.)

From SSE2 onwards, some data movement/bitwise instructions exist in three forms: an integer form, an FP32 form and an FP64 form. Such instructions are functionally identical, however some processors with SSE2 will implement integer, FP32 and FP64 execution units as three different execution clusters, where forwarding of results from one cluster to another may come with performance penalties and where such penalties can be minimzed by choosing instruction forms appropriately. (For example, there exists three forms of vector bitwise XOR instructions under SSE2 - PXOR, XORPS, and XORPD - these are intended for use on integer, FP32, and FP64 data, respectively.)

More information Instruction Description, Basic opcode ...

Instruction Description	Basic opcode	Single Precision (FP32)						Double Precision (FP64)						AVX-512: RC/SAE
		Packed (no prefix)			Scalar (F3h prefix)			Packed (66h prefix)			Scalar (F2h prefix)
		SSE instruction	AVX (VEX)	AVX-512 (EVEX)	SSE instruction	AVX (VEX)^[a]	AVX-512 (EVEX)	SSE2 instruction	AVX (VEX)	AVX-512 (EVEX)	SSE2 instruction	AVX (VEX)^[a]	AVX-512 (EVEX)
Unaligned load from memory or vector register	`0F 10 /r`	`MOVUPS x,x/m128`	Yes	Yes^[b]	`MOVSS x,x/m32`	Yes	Yes	`MOVUPD x,x/m128`	Yes	Yes^[b]	`MOVSD x,x/m64`^[c]	Yes	Yes	No
Unaligned store to memory or vector register	`0F 11 /r`	`MOVUPS x/m128,x`	Yes	Yes^[b]	`MOVSS x/m32,x`	Yes	Yes	`MOVUPD x/m128,x`	Yes	Yes^[b]	`MOVSD x/m64,x`^[c]	Yes	Yes	No
Load 64 bits from memory or upper half of XMM register into the lower half of XMM register while keeping the upper half unchanged	`0F 12 /r`	`MOVHLPS x,x`	(L0)^[d]	(L0)^[d]	(MOVSLDUP)^[e]			`MOVLPD x,m64`	(L0)^[d]	(L0)^[d]	(MOVDDUP)^[e]			No
	`0F 12 /r`	`MOVLPS x,m64`	(L0)^[d]	(L0)^[d]	(MOVSLDUP)^[e]			`MOVLPD x,m64`	(L0)^[d]	(L0)^[d]	(MOVDDUP)^[e]			No
Store 64 bits to memory from lower half of XMM register	`0F 13 /r`	`MOVLPS m64,x`	(L0)^[d]	(L0)^[d]	No	No	No	`MOVLPD m64,x`	(L0)^[d]	(L0)^[d]	No	No	No	No
Unpack and interleave low-order floating-point values	`0F 14 /r`	`UNPCKLPS x,x/m128`	Yes^[f]	Yes^[f]	No	No	No	`UNPCKLPD x,x/m128`	Yes^[f]	Yes^[f]	No	No	No	No
Unpack and interleave high-order floating-point values	`0F 15 /r`	`UNPCKHPS x,x/m128`	Yes^[f]	Yes^[f]	No	No	No	`UNPCKHPD x,x/m128`	Yes^[f]	Yes^[f]	No	No	No	No
Load 64 bits from memory or lower half of XMM register into the upper half of XMM register while keeping the lower half unchanged	`0F 16 /r`	`MOVLHPS x,x`	(L0)^[d]	(L0)^[d]	(MOVSHDUP)^[e]			`MOVHPD x,m64`	(L0)^[d]	(L0)^[d]	No	No	No	No
	`0F 16 /r`	`MOVHPS x,m64`	(L0)^[d]	(L0)^[d]	(MOVSHDUP)^[e]			`MOVHPD x,m64`	(L0)^[d]	(L0)^[d]	No	No	No	No
Store 64 bits to memory from upper half of XMM register	`0F 17 /r`	`MOVHPS m64,x`	(L0)^[d]	(L0)^[d]	No	No	No	`MOVHPD m64,x`	(L0)^[d]	(L0)^[d]	No	No	No	No

Aligned load from memory or vector register	`0F 28 /r`	`MOVAPS x,x/m128`	Yes	Yes^[b]	No	No	No	`MOVAPD x,x/m128`	Yes	Yes^[b]	No	No	No	No
Aligned store to memory or vector register	`0F 29 /r`	`MOVAPS x/m128,x`	Yes	Yes^[b]	No	No	No	`MOVAPD x/m128,x`	Yes	Yes^[b]	No	No	No	No
Integer to floating-point conversion using general-registers, MMX-registers or memory as source	`0F 2A /r`	`CVTPI2PS x,mm/m64`^[g]	No	No	`CVTSI2SS x,r/m32` `CVTSI2SS x,r/m64` ^[h]	Yes	Yes^[i]	`CVTPI2PD x,mm/m64`^[g]	No	No	`CVTSI2SD x,r/m32` `CVTSI2SD x,r/m64` ^[h]	Yes	Yes^[i]	RC
Non-temporal store to memory from vector register. The packed variants require aligned memory addresses even in VEX/EVEX-encoded forms.	`0F 2B /r`	`MOVNTPS m128,x`	Yes	Yes^[i]	`MOVNTSS m32,x` (AMD SSE4a)	No	No	`MOVNTPD m128,x`	Yes	Yes^[i]	`MOVNTSD m64,x` (AMD SSE4a)	No	No	No
Floating-point to integer conversion with truncation, using general-purpose registers or MMX-registers as destination	`0F 2C /r`	`CVTTPS2PI mm,x/m64`^[j]	No	No	`CVTTSS2SI r32,x/m32` `CVTTSS2SI r64,x/m32`^[k]	Yes	Yes^[i]	`CVTTPD2PI mm,x/m64`^[j]	No	No	`CVTTSD2SI r32,x/m64` `CVTTSD2SI r64,x/m64`^[k]	Yes	Yes^[i]	SAE
Floating-point to integer conversion, using general-purpose registers or MMX-registers as destination	`0F 2D /r`	`CVTPS2PI mm,x/m64`^[j]	No	No	`CVTSS2SI r32,x/m32` `CVTSS2SI r64,x/m32`^[k]	Yes	Yes^[i]	`CVTPD2PI mm,x/m64`^[j]	No	No	`CVTSD2SI r32,x/m64` `CVTSD2SI r64,x/m64`^[k]	Yes	Yes^[i]	RC
Unordered compare floating-point values and set EFLAGS. Compares the bottom lanes of xmm vector registers.	`0F 2E /r`	`UCOMISS x,x/m32`	Yes^[a]	Yes^[i]	No	No	No	`UCOMISD x,x/m64`	Yes^[a]	Yes^[i]	No	No	No	SAE
Compare floating-point values and set EFLAGS. Compares the bottom lanes of xmm vector registers.	`0F 2F /r`	`COMISS x,x/m32`	Yes^[a]	Yes^[i]	No	No	No	`COMISD x,x/m64`	Yes^[a]	Yes^[i]	No	No	No	SAE

Extract packed floating-point sign mask	`0F 50 /r`	`MOVMSKPS r32,x`	Yes	No^[l]	No	No	No	`MOVMSKPD r32,x`	Yes	No^[l]	No	No	No	—
Floating-point Square Root	`0F 51 /r`	`SQRTPS x,x/m128`	Yes	Yes	`SQRTSS x,x/m32`	Yes	Yes	`SQRTPD x,x/m128`	Yes	Yes	`SQRTSD x,x/m64`	Yes	Yes	RC
Reciprocal Square Root Approximation^[m]	`0F 52 /r`	`RSQRTPS x,x/m128`	Yes	No^[n]	`RSQRTSS x,x/m32`	Yes	No^[n]	No	No	No^[n]	No	No	No^[n]	—
Reciprocal Approximation^[m]	`0F 53 /r`	`RCPPS x,x/m128`	Yes	No^[o]	`RCPSS x,x/m32`	Yes	No^[o]	No	No	No^[o]	No	No	No^[o]	—
Vector bitwise AND	`0F 54 /r`	`ANDPS x,x/m128`	Yes	(DQ)^[p]	No	No	No	`ANDPD x,x/m128`	Yes	(DQ)^[p]	No	No	No	No
Vector bitwise AND-NOT	`0F 55 /r`	`ANDNPS x,x/m128`	Yes	(DQ)^[p]	No	No	No	`ANDNPD x,x/m128`	Yes	(DQ)^[p]	No	No	No	No
Vector bitwise OR	`0F 56 /r`	`ORPS x,x/m128`	Yes	(DQ)^[p]	No	No	No	`ORPD x,x/m128`	Yes	(DQ)^[p]	No	No	No	No
Vector bitwise XOR^[q]	`0F 57 /r`	`XORPS x,x/m128`	Yes	(DQ)^[p]	No	No	No	`XORPD x,x/m128`	Yes	(DQ)^[p]	No	No	No	No

Floating-point Add	`0F 58 /r`	`ADDPS x,x/m128`	Yes	Yes	`ADDSS x,x/m32`	Yes	Yes	`ADDPD x,x/m128`	Yes	Yes	`ADDSD x,x/m64`	Yes	Yes	RC
Floating-point Multiply	`0F 59 /r`	`MULPS x,x/m128`	Yes	Yes	`MULSS x,x/m32`	Yes	Yes	`MULPD x,x/m128`	Yes	Yes	`MULSD x,x/m64`	Yes	Yes	RC
Convert between floating-point formats (FP32→FP64, FP64→FP32)	`0F 5A /r`	`CVTPS2PD x,x/m64` (SSE2)	Yes	Yes^[r]	`CVTSS2SD x,x/m32` (SSE2)	Yes	Yes^[r]	`CVTPD2PS x,x/m128`	Yes	Yes^[r]	`CVTSD2SS x,x/m64`	Yes	Yes^[r]	SAE, RC^[s]
Floating-point Subtract	`0F 5C /r`	`SUBPS x,x/m128`	Yes	Yes	`SUBSS x,x/m32`	Yes	Yes	`SUBPD x,x/m128`	Yes	Yes	`SUBSD x,x/m64`	Yes	Yes	RC
Floating-point Minimum Value^[t]	`0F 5D /r`	`MINPS x,x/m128`	Yes	Yes	`MINSS x,x/m32`	Yes	Yes	`MINPD x,x/m128`	Yes	Yes	`MINSD x,x/m64`	Yes	Yes	SAE
Floating-point Divide	`0F 5E /r`	`DIVPS x,x/m128`	Yes	Yes	`DIVSS x,x/m32`	Yes	Yes	`DIVPD x,x/m128`	Yes	Yes	`DIVSD x,x/m64`	Yes	Yes	RC
Floating-point Maximum Value^[t]	`0F 5F /r`	`MAXPS x,x/m128`	Yes	Yes	`MAXSS x,x/m32`	Yes	Yes	`MAXPD x,x/m128`	Yes	Yes	`MAXSD x,x/m64`	Yes	Yes	SAE

Floating-point compare. Result is written as all-0s/all-1s values (all-1s for comparison true) to vector registers for SSE/AVX, but opmask register for AVX-512. Comparison function is specified by imm8 argument.^[u]	`0F C2 /r ib`	`CMPPS x,x/m128,imm8`	Yes	Yes	`CMPSS x,x/m32,imm8`	Yes	Yes	`CMPPD x,x/m128,imm8`	Yes	Yes	`CMPSD x,x/m64,imm8` ^[c]	Yes	Yes	SAE
Packed Interleaved Shuffle. Performs a shuffle on each of its two input arguments, then keeps the bottom half of the shuffle result from its first argument and the top half of the shuffle result from its second argument.	`0F C6 /r ib`	`SHUFPS x,x/m128,imm8`^[f]	Yes	Yes	No	No	No	`SHUFPD x,x/m128,imm8`^[f]	Yes	Yes	No	No	No	No

[a]
The VEX-prefix-encoded variants of the scalar instructions listed in this table should be encoded with VEX.L=0. Setting VEX.L=1 for any of these instructions is allowed but will result in what the Intel SDM describes as "unpredictable behavior across different processor generations". This also applies to VEX-encoded variants of V(U)COMISS and V(U)COMISD. (This behavior does not apply to scalar instructions outside this table, such as e.g. VMOVD/VMOVQ, where VEX.L=1 results in an #UD exception.)
[b]
EVEX-encoded variants of VMOVAPS, VMOVUPS, VMOVAPD and VMOVUPD support opmasks but do not support broadcast.
[c]
The SSE2 MOVSD (MOVe Scalar Double-precision) and CMPSD (CoMPare Scalar Double-precision) instructions have the same names as the older i386 MOVSD (MOVe String Doubleword) and CMPSD (CoMPare String Doubleword) instructions, however their operations are completely unrelated.
At the assembly language level, they can be distinguished by their use of XMM register operands.
[d]
For variants of VMOVLPS, VMOVHPS, VMOVLPD, VMOVHPD, VMOVLHPS, VMOVHLPS encoded with VEX or EVEX prefixes, the only supported vector length is 128 bits (VEX.L=0 or EVEX.L=0).
For the EVEX-encoded variants, broadcasts and opmasks are not supported.
[e]
The MOVSLDUP, MOVSHDUP and MOVDDUP instructions are not regularly-encoded scalar SSE1/2 instructions, but instead irregularly-assigned SSE3 vector instructions. For a description of these instructions, see table below.
[f]
For the VUNPCK*, VSHUFPS and VSHUFPD instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction (except that for VSHUFPD, each 128-bit lane will use a different 2-bit part of the instruction's imm8 argument).
[g]
The CVTPI2PS and CVTPI2PD instructions take their input data as a vector of two 32-bit signed integers from either memory or MMX register. They will cause an x87→MMX transition even if the source operand is a memory operand.
For vector int→FP conversions that can accept an xmm/ymm/zmm register or vectors wider than 64 bits as input arguments, SSE2 provides the following irregularly-assigned instructions (see table below):
CVTDQ2PS (0F 5B /r)
CVTDQ2PD (F3 0F E6 /r)
These exist in AVX/AVX-512 extended forms as well.
[h]
For the (V)CVTSI2SS and (V)CVTSI2SD instructions, variants with a 64-bit source argument are only available in 64-bit long mode and require REX.W, VEX.W or EVEX.W to be set to 1.
In 32-bit mode, their source argument is always 32-bit even if VEX.W or EVEX.W is set to 1.
[i]
EVEX-encoded variants of
VMOVNTPS, VMOVNTSS
VCOMISS, VCOMISD, VUCOMISS, VUCOMISD
VCVTSI2SS, VCTSI2SD
VCVT(T)SS2SI,VCVT(T)SD2SI
support neither opmasks nor broadcast.
[j]
The CVT(T)PS2PI and CVT(T)PD2PI instructions write their result to MMX register as a vector of two 32-bit signed integers.
For vector FP→int conversions that can write results to xmm/ymm/zmm registers, SSE2 provides the following irregularly-assigned instructions (see table below):
CVTPS2DQ (66 0F 5B /r)
CVTTPS2DQ (F3 0F 5B /r)
CVTPD2DQ (F2 0F E6 /r)
CVTTPD2DQ (66 0F E6 /r)
These exist in AVX/AVX-512 extended forms as well.
[k]
For the (V)CVT(T)SS2SI and (V)CVT(T)SD2SI instructions, variants with a 64-bit destination register are only available in 64-bit long mode and require REX.W, VEX.W or EVEX.W to be set to 1.
In 32-bit mode, their destination register is always 32-bit even if VEX.W or EVEX.W is set to 1.
[l]
This instruction cannot be EVEX-encoded. Under AVX512DQ, extracting packed floating-point sign-bits can instead be done with the VPMOVD2M and VPMOVQ2M instructions.
[m]
The (V)RCPSS, (V)RCPPS, (V)RSQRTSS and (V)RSQRTPS approximation instructions compute their result with a relative error of at most $\pm 1.5*2^{-12}$ . The exact calculation is implementation-specific and known to vary between different x86 CPUs.^[11]
[n]
This instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes - EVEX.66.0F38 4E/4F /r - for its new VRSQRT14* reciprocal square root approximation instructions.
The main difference between the AVX-512 VRSQRT14* instructions and the older SSE/AVX (V)RSQRT* instructions is that the AVX-512 VRSQRT14* instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel.^[12]
[o]
This instruction cannot be EVEX-encoded. Instead, AVX512F provides different opcodes - EVEX.66.0F38 4C/4D /r - for its new VRCP14* reciprocal approximation instructions.
The main difference between the AVX-512 VRCP14* instructions and the older SSE/AVX (V)RCP* instructions is that the AVX-512 VRRCP14* instructions have their operation defined in a bit-exact manner, with a C reference model provided by Intel.^[12]
[p]
The EVEX-encoded versions of the VANDPS, VANDPD, VANDNPS, VANDNPD, VORPS, VORPD, VXORPS, VXORPD instructions are not introduced as part of the AVX512F subset, but instead the AVX512DQ subset.
[q]
XORPS/VXORPS with both source operands being the same register is commonly used as a register-zeroing idiom, and is recognized by most x86 CPUs as an instruction that does not depend on its source arguments.
Under AVX or AVX-512, it is recommended to use a 128-bit form of VXORPS for this purpose - this will, on some CPUs, result in fewer micro-ops than wider forms while still achieving register-zeroing of the whole 256 or 512 bit vector-register.^[13]
[r]
For EVEX-encoded variants of conversions between FP formats of different widths, the opmask lane width is determined by the result format: 64-bit for VCVTPS2PD and VCVTSS2SD and 32-bit for VCVTPD2PS and VCVTSS2SD.
[s]
Widening FP→FP conversions (CVTPS2PD, CVTSS2SD, VCVTPH2PD, VCVTSH2SD) support the SAE modifier. Narrowing conversions (CVTPD2PS, CVTSD2SS) support the RC modifier.
[t]
For the floating-point minimum-value and maximum-value instructions (V)MIN* and (V)MAX*, if the two input operands are both zero or at least one of the input operands is NaN, then the second input operand is returned. This matches the behavior of common C programming-language expressions such as ((op1)>(op2)?(op1):(op2)) for maximum-value and ((op1)<(op2)?(op1):(op2)) for minimum-value.
[u]
For the SIMD floating-point compares, the imm8 argument has the following format:
More information Bits, Usage ...

Bits Usage

1:0 Basic comparison predicate

2 Invert comparison result

3 Invert comparison result if unordered (VEX/EVEX only)

4 Invert signalling behavior (VEX/EVEX only)

Close
The basic comparison predicates are:
More information Value, Meaning ...

Value Meaning

00b Equal (non-signalling)

01b Less-than (signalling)

10b Less-than-or-equal (signalling)

11b Unordered (non-signalling)

Close
A signalling compare will cause an exception if any of the inputs are QNaN.

Bits	Usage
1:0	Basic comparison predicate
2	Invert comparison result
3	Invert comparison result if unordered (VEX/EVEX only)
4	Invert signalling behavior (VEX/EVEX only)

Value	Meaning
00b	Equal (non-signalling)
01b	Less-than (signalling)
10b	Less-than-or-equal (signalling)
11b	Unordered (non-signalling)

Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof

These instructions do not have any MMX forms, and do not support any encodings without a prefix. Most of these instructions have extended variants available in VEX-encoded and EVEX-encoded forms:

The VEX-encoded forms are available under AVX/AVX2. Under AVX, they are available only with a vector length of 128 bits (VEX.L=0 enocding) - under AVX2, they are (with some exceptions noted with "L=0") also made available with a vector length of 256 bits.
The EVEX-encoded forms are available under AVX-512 - the specific AVX-512 subset needed for each instruction is listed along with the instruction.

More information Description, Instruction mnemonics ...

Description		Instruction mnemonics	Basic opcode	SSE (66h prefix)	AVX (VEX.66 prefix)	AVX-512 (EVEX.66 prefix)
Description		Instruction mnemonics	Basic opcode	SSE (66h prefix)	AVX (VEX.66 prefix)	supported	subset	lane	bcst

Added with SSE2
Unpack and interleave low-order 64-bit integers		`(V)PUNPCKLQDQ xmm,xmm/m128`^[a]	`0F 6C /r`	Yes	Yes	Yes (W=1)	F	64	64
Unpack and interleave high-order 64-bit integers		`(V)PUNPCKHQDQ xmm,xmm/m128`^[a]	`0F 6D /r`	Yes	Yes	Yes (W=1)	F	64	64
Right-shift 128-bit unsigned integer by specified number of bytes		`(V)PSRLDQ xmm,imm8`^[a]	`0F 73 /3 ib`	Yes	Yes	Yes	BW	No	No
Left-shift 128-bit integer by specified number of bytes		`(V)PSLLDQ xmm,imm8`^[a]	`0F 73 /7 ib`	Yes	Yes	Yes	BW	No	No
Move 64-bit scalar value from xmm register to xmm register or memory		`(V)MOVQ xmm/m64,xmm`	`0F D6 /r`	Yes	Yes (L=0)	Yes (L=0,W=1)	F	No	No
Added with SSE4.1
Variable blend packed bytes. For each byte lane of the result, pick the value from either the first or the second argument depending on the top bit of the corresponding byte lane of `XMM0`.		`PBLENDVB xmm,xmm/m128` `PBLENDVB xmm,xmm/m128,XMM0`^[b]	`0F38 10 /r`	Yes	No^[c]	No^[d]	—	—	—
Sign-extend packed integers into wider packed integers	8-bit → 16-bit	`(V)PMOVSXBW xmm,xmm/m64`	`0F38 20 /r`	Yes	Yes	Yes	BW	16	No
	8-bit → 32-bit	`(V)PMOVSXBD xmm,xmm/m32`	`0F38 21 /r`	Yes	Yes	Yes	F	32	No
	8-bit → 64-bit	`(V)PMOVSXBQ xmm,xmm/m16`	`0F38 22 /r`	Yes	Yes	Yes	F	64	No
	16-bit → 32-bit	`(V)PMOVSXWD xmm,xmm/m64`	`0F38 23 /r`	Yes	Yes	Yes	F	32	No
	16-bit → 64-bit	`(V)PMOVSXWQ xmm,xmm/m32`	`0F38 24 /r`	Yes	Yes	Yes	F	64	No
	32-bit → 64-bit	`(V)PMOVSXDQ xmm,xmm/m64`	`0F38 25 /r`	Yes	Yes	Yes (W=0)	F	64	No

Multiply packed 32-bit signed integers, store full 64-bit result. The input integers are taken from the low 32 bits of each 64-bit vector lane.		`(V)PMULDQ xmm,xmm/m128`	`0F38 28 /r`	Yes	Yes	Yes (W=1)	F	64	64
Compare packed 64-bit integers for equality		`(V)PCMPEQQ xmm,xmm/m128`	`0F38 29 /r`	Yes	Yes	Yes (W=1)^[e]	F	64	64
Aligned non-temporal vector load from memory.^[f]		`(V)MOVNTDQA xmm,m128`	`0F38 2A /r`	Yes	Yes	Yes (W=0)	F	No	No
Pack 32-bit unsigned integers to 16-bit, with saturation		`(V)PACKUSDW xmm, xmm/m128`^[a]	`0F38 2B /r`	Yes	Yes	Yes (W=0)	BW	16	32

Zero-extend packed integers into wider packed integers	8-bit → 16-bit	`(V)PMOVZXBW xmm,xmm/m64`	`0F38 30 /r`	Yes	Yes	Yes	BW	16	No
	8-bit → 32-bit	`(V)PMOVZXBD xmm,xmm/m32`	`0F38 31 /r`	Yes	Yes	Yes	F	32	No
	8-bit → 64-bit	`(V)PMOVZXBQ xmm,xmm/m16`	`0F38 32 /r`	Yes	Yes	Yes	F	64	No
	16-bit → 32-bit	`(V)PMOVZXWD xmm,xmm/m64`	`0F38 33 /r`	Yes	Yes	Yes	F	32	No
	16-bit → 64-bit	`(V)PMOVZXWQ xmm,xmm/m32`	`0F38 34 /r`	Yes	Yes	Yes	F	64	No
	32-bit → 64-bit	`(V)PMOVZXDQ xmm,xmm/m64`	`0F38 35 /r`	Yes	Yes	Yes (W=0)	F	64	No

Packed minimum-value of signed integers	8-bit	`(V)PMINSB xmm,xmm/m128`	`0F38 38 /r`	Yes	Yes	Yes	BW	8	No
	32-bit	`(V)PMINSD xmm,xmm/m128`	`0F38 39 /r`	`PMINSD`	`VPMINSD`	`VPMINSD`(W0)	F	32	32
	64-bit	`VPMINSQ xmm,xmm/m128`(AVX-512)	`0F38 39 /r`	`PMINSD`	`VPMINSD`	`VPMINSQ`(W1)	F	64	64
Packed minimum-value of unsigned integers	16-bit	`(V)PMINUW xmm,xmm/m128`	`0F38 3A /r`	Yes	Yes	Yes	BW	16	No
	32-bit	`(V)PMINUD xmm,xmm/m128`	`0F38 3B /r`	`PMINUD`	`VPMINUD`	`VPMINUD`(W0)	F	32	32
	64-bit	`VPMINUQ xmm,xmm/m128`(AVX-512)	`0F38 3B /r`	`PMINUD`	`VPMINUD`	`VPMINUQ`(W1)	F	64	64
Packed maximum-value of signed integers	8-bit	`(V)PMAXSB xmm,xmm/m128`	`0F38 3C /r`	Yes	Yes	Yes	BW	8	No
	32-bit	`(V)PMAXSD xmm,xmm/m128`	`0F38 3D /r`	`PMAXSD`	`VPMAXSD`	`VPMAXSD`(W0)	F	32	32
	64-bit	`VPMAXSQ xmm,xmm/m128`(AVX-512)	`0F38 3D /r`	`PMAXSD`	`VPMAXSD`	`VPMAXSQ`(W1)	F	64	64
Packed maximum-value of unsigned integers	16-bit	`(V)PMAXUW xmm,xmm/m128`	`0F38 3E /r`	Yes	Yes	Yes	BW	16	No
	32-bit	`(V)PMAXUD xmm,xmm/m128`	`0F38 3F /r`	`PMAXUD`	`VPMAXUD`	`VPMAXUD`(W0)	F	32	32
	64-bit	`VPMAXUQ xmm,xmm/m128`(AVX-512)	`0F38 3F /r`	`PMAXUD`	`VPMAXUD`	`VPMAXUQ`(W1)	F	64	64

Multiply packed 32/64-bit integers, store low half of results		`(V)PMULLD mm,mm/m64` `PMULLQ xmm,xmm/m128`(AVX-512)	`0F38 40 /r`	`PMULLD`	`VPMULLD`	`VPMULLD`(W0)	F	32	32
		`(V)PMULLD mm,mm/m64` `PMULLQ xmm,xmm/m128`(AVX-512)	`0F38 40 /r`	`PMULLD`	`VPMULLD`	`VPMULLQ`(W1)	DQ	64	64
Packed Horizontal Word Minimum Find the smallest 16-bit integer in a packed vector of 16-bit unsigned integers, then return the integer and its index in the bottom two 16-bit lanes of the result vector.		`(V)PHMINPOSUW xmm,xmm/m128`	`0F38 41 /r`	Yes	Yes (L=0)	No	—	—	—
Blend Packed Words. For each 16-bit lane of the result, pick a 16-bit value from either the first or the second source argument depending on the corresponding bit of the imm8.		`(V)PBLENDW xmm,xmm/m128,imm8`^[a]	`0F3A 0E /r ib`	Yes	Yes^[g]	No^[h]	—	—	—

Extract integer from indexed lane of vector register, and store to GPR or memory. Zero-extended if stored to GPR.	8-bit	`(V)PEXTRB r32/m8,xmm,imm8`^[i]	`0F3A 14 /r ib`	Yes	Yes (L=0)	Yes (L=0)	BW	No	No
	16-bit	`(V)PEXTRW r32/m16,xmm,imm8`^[i]	`0F3A 15 /r ib`	Yes	Yes (L=0)	Yes (L=0)	BW	No	No
	32-bit	`(V)PEXTRD r/m32,xmm,imm8`	`0F3A 16 /r ib`	Yes	Yes (L=0,W=0)^[j]	Yes (L=0,W=0)	DQ	No	No
	64-bit (x86-64)	`(V)PEXTRQ r/m64,xmm,imm8`	`0F3A 16 /r ib`	Yes (REX.W)	Yes (L=0,W=1)	Yes (L=0,W=1)	DQ	No	No
Insert integer from general-purpose register into indexed lane of vector register	8-bit	`(V)PINSRB xmm,r32/m8,imm8`^[k]	`0F3A 20 /r ib`	Yes	Yes (L=0)	Yes (L=0)	BW	No	No
	32-bit	`(V)PINSRD xmm,r32/m32,imm8`	`0F3A 22 /r ib`	Yes	Yes (L=0,W=0)^[j]	Yes (L=0,W=0)	DQ	No	No
	64-bit (x86-64)	`(V)PINSRQ xmm,r64/m64,imm8`	`0F3A 22 /r ib`	Yes (REX.W)	Yes (L=0,W=1)	Yes (L=0,W=1)	DQ	No	No

Compute Multiple Packed Sums of Absolute Difference. The 128-bit form of this instruction computes 8 sums of absolute differences from sequentially selected groups of four bytes in the first source argument and a selected group of four contiguous bytes in the second source operand, and writes the sums to sequential 16-bit lanes of destination register. If the two source arguments `src1` and `src2` are considered to be two 16-entry arrays of uint8 values and `temp` is considered to be an 8-entry array of uint16 values, then the operation of the instruction is: for i = 0 to 7 do temp[i] := 0 for j = 0 to 3 do a := src1[ i+(imm8[2]4)+j ] b := src2[ (imm8[1:0]4)+j ] temp[i] := temp[i] + abs(a-b) done done dst := temp For wider forms of this instruction under AVX2 and AVX10.2, the operation is split into 128-bit lanes where each lane internally performs the same operation as the 128-bit variant of the instruction - except that odd-numbered lanes use bits 5:3 rather than bits 2:0 of the imm8.		`(V)MPSADBW xmm,xmm/m128,imm8`	`0F3A 42 /r ib`	Yes	Yes	Yes (W=0)	10.2^[l]	16	No
Added with SSE 4.2
Compare packed 64-bit signed integers for greater-than		`(V)PCMPGTQ xmm, xmm/m128`	`0F38 37 /r`	Yes	Yes	Yes (W=1)^[e]	F	64	64
Packed Compare Explicit Length Strings, Return Mask		`(V)PCMPESTRM xmm,xmm/m128,imm8`	`0F3A 60 /r ib`	Yes^[m]	Yes (L=0)	No	—	—	—
Packed Compare Explicit Length Strings, Return Index		`(V)PCMPESTRI xmm,xmm/m128,imm8`	`0F3A 61 /r ib`	Yes^[m]	Yes (L=0)	No	—	—	—
Packed Compare Implicit Length Strings, Return Mask		`(V)PCMPISTRM xmm,xmm/m128,imm8`	`0F3A 62 /r ib`	Yes^[m]	Yes (L=0)	No	—	—	—
Packed Compare Implicit Length Strings, Return Index		`(V)PCMPISTRI xmm,xmm/m128,imm8`	`0F3A 63 /r ib`	Yes^[m]	Yes (L=0)	No	—	—	—

[a]
For the (V)PUNPCK*, (V)PACKUSDW, (V)PBLENDW, (V)PSLLDQ and (V)PSLRDQ instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
[b]
Assemblers may accept PBLENDVB with or without XMM0 as a third argument.
[c]
The PBLENDVB instruction with opcode 66 0F38 10 /r is not VEX-encodable. AVX does provide a VPBLENDVB instruction that is similar to PBLENDVB, however, it uses a different opcode and operand encoding - VEX.66.0F3A.W0 4C /r /is4.
[d]
Opcode not EVEX-encodable. Under AVX-512, variable blend of packed bytes may be done with the VPBLENDMB instruction (opcode EVEX.66.0F38.W0 66 /r).
[e]
The EVEX-encoded variants of the VPCMPEQ* and VPCMPGT* instructions write their results to AVX-512 opmask registers. This differs from the older non-EVEX variants, which write comparison results as vectors of all-0s/all-1s values to the regular mm/xmm/ymm vector registers.
[f]
The load performed by (V)MOVNTDQA is weakly-ordered. It may be reordered with respect to other loads, stores and even LOCKs - to impose ordering with respect to other loads/stores, MFENCE or serialization is needed.
If (V)MOVNTDQA is used with uncached memory, it may fetch a cache-line-sized block of data around the data actually requested - subsequent (V)MOVNTDQA instructions may return data from blocks fetched in this manner as long as they are not separated by an MFENCE or serialization.
[g]
For AVX, the VBLENDPS and VPBLENDD instructions can be used to perform a blend with 32-bit lanes, allowing one imm8 mask to span a full 256-bit vector without repetition.
[h]
Opcode not EVEX-encodable. Under AVX-512, variable blend of packed words may be done with the VPBLENDMW instruction (opcode EVEX.66.0F38.W1 66 /r).
[i]
For (V)PEXTRB and (V)PEXTRW, if the destination argument is a register, then the extracted 8/16-bit value is zero-extended to 32/64 bits.
[j]
For the VPEXTRD and VPINSRD instructions in non-64-bit mode, the instructions are documented as being permitted to be encoded with VEX.W=1 on Intel^[14] but not AMD^[15] CPUs (although exceptions to this do exist, e.g. Bulldozer permits such encodings^[16] while Sandy Bridge does not^[17])
In 64-bit mode, these instructions require VEX.W=0 on both Intel and AMD processors — encodings with VEX.W=1 are interpreted as VPEXTRQ/VPINSRQ.
[k]
In the case of a register source argument to (V)PINSRB, the argument is considered to be a 32-bit register of which the 8 bottom bits are used, not an 8-bit register proper. This means that it is not possible to specify AH/BH/CH/DH as a source argument to (V)PINSRB.
[l]
EVEX-encoded variants of the VMPSADBW instruction are only available if AVX10.2 is supported.
[m]
The SSE4.2 packed string compare PCMP*STR* instructions allow their 16-byte memory operands to be misaligned even when using legacy SSE encoding.

Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof

SSE SIMD instructions that do not fit into any of the preceding groups. Many of these instructions have AVX/AVX-512 extended forms - unless otherwise indicated (L=0 or footnotes) these extended forms support 128/256-bit operation under AVX and 128/256/512-bit operation under AVX-512.

More information Description, Instruction mnemonics ...

Description		Instruction mnemonics	Basic opcode	SSE	AVX (VEX prefix)	AVX-512 (EVEX prefix)
Description		Instruction mnemonics	Basic opcode	SSE	AVX (VEX prefix)	supported	subset	lane	bcst	rc/sae

Added with SSE
Load MXCSR (Media eXtension Control and Status Register) from memory		`(V)LDMXCSR m32`	`NP 0F AE /2`	Yes	Yes (L=0)	No	—	—	—	—
Store MXCSR to memory		`(V)STMXCSR m32`	`NP 0F AE /3`	Yes	Yes (L=0)	No	—	—	—	—
Added with SSE2
Move a 64-bit data item from MMX register to bottom half of XMM register. Top half is zeroed out.		`MOVQ2DQ xmm,mm`	`F3 0F D6 /r`	Yes	No	No	—	—	—	—
Move a 64-bit data item from bottom half of XMM register to MMX register.		`MOVDQ2Q mm,xmm`	`F2 0F D6 /r`	Yes	No	No	—	—	—	—
Load a 64-bit integer from memory or XMM register to bottom 64 bits of XMM register, with zero-fill		`(V)MOVQ xmm,xmm/m64`	`F3 0F 7E /r`	Yes	Yes (L=0)	Yes (L=0,W=1)	F	No	No	No
Vector load from unaligned memory or vector register		`(V)MOVDQU xmm,xmm/m128`	`F3 0F 6F /r`	Yes	Yes	`VMOVDQU64`(W1)	F	64	No	No
			`F3 0F 6F /r`	Yes	Yes	`VMOVDQU32`(W0)	F	32	No	No
			`F2 0F 6F /r`	No	No	`VMOVDQU16`(W1)	BW	16	No	No
			`F2 0F 6F /r`	No	No	`VMOVDQU8`(W0)	BW	8	No	No
Vector store to unaligned memory or vector register		`(V)MOVDQU xmm/m128,xmm`	`F3 0F 7F /r`	Yes	Yes	`VMOVDQU64`(W1)	F	64	No	No
			`F3 0F 7F /r`	Yes	Yes	`VMOVDQU32`(W0)	F	32	No	No
			`F2 0F 7F /r`	No	No	`VMOVDQU16`(W1)	BW	16	No	No
			`F2 0F 7F /r`	No	No	`VMOVDQU8`(W0)	BW	8	No	No
Shuffle the four top 16-bit lanes of source vector, then place result in top half of destination vector		`(V)PSHUFHW xmm,xmm/m128,imm8`^[a]	`F3 0F 70 /r ib`	Yes	Yes^[b]	Yes	BW	16	No	No
Shuffle the four bottom 16-bit lanes of source vector, then place result in bottom half of destination vector		`(V)PSHUFLW xmm,xmm/m128,imm8`^[a]	`F2 0F 70 /r ib`	Yes	Yes^[b]	Yes	BW	16	No	No
Convert packed signed 32-bit integers to FP32		`(V)CVTDQ2PS xmm,xmm/m128`	`NP 0F 5B /r`	Yes	Yes	Yes (W=0)	F	32	32	RC
Convert packed FP32 values to packed signed 32-bit integers		`(V)CVTPS2DQ xmm,xmm/m128`	`66 0F 5B /r`	Yes	Yes	Yes (W=0)	F	32	32	RC
Convert packed FP32 values to packed signed 32-bit integers, with round-to-zero		`(V)CVTTPS2DQ xmm,xmm/m128`	`F3 0F 5B /r`	Yes	Yes	Yes (W=0)	F	32	32	SAE
Convert packed FP64 values to packed signed 32-bit integers, with round-to-zero		`(V)CVTTPD2DQ xmm,xmm/m64`	`66 0F E6 /r`	Yes	Yes	Yes (W=1)	F	32	64	SAE
Convert packed signed 32-bit integers to FP64		`(V)CVTDQ2PD xmm,xmm/m64`	`F3 0F E6 /r`	Yes	Yes	Yes (W=0)	F	64	32	RC^[c]
Convert packed FP64 values to packed signed 32-bit integers		`(V)CVTPD2DQ xmm,xmm/m128`	`F2 0F E6 /r`	Yes	Yes	Yes (W=1)	F	32	64	RC
Added with SSE3
Duplicate floating-point values from even-numbered lanes to next odd-numbered lanes up	32-bit	`(V)MOVSLDUP xmm,xmm/m128`	`F3 0F 12 /r`	Yes	Yes	Yes (W=0)	F	32	No	No
	64-bit	`(V)MOVDDUP xmm/xmm/m128`	`F2 0F 12 /r`	Yes	Yes	Yes (W=1)	F	64	No	No
Duplicate FP32 values from odd-numbered lanes to next even-numbered lanes down		`(V)MOVSHDUP xmm,xmm/m128`	`F3 0F 16 /r`	Yes	Yes	Yes (W=0)	F	32	No	No
Packed pairwise horizontal addition of floating-point values	32-bit	`(V)HADDPS xmm,xmm/m128`^[a]	`F2 0F 7C /r`	Yes	Yes	No	—	—	—	—
	64-bit	`(V)HADDPD xmm,xmm/m128`^[a]	`66 0F 7C /r`	Yes	Yes	No	—	—	—	—
Packed pairwise horizontal subtraction of floating-point values	32-bit	`(V)HSUBPS xmm,xmm/m128`^[a]	`F2 0F 7D /r`	Yes	Yes	No	—	—	—	—
	64-bit	`(V)HSUBPD xmm,xmm/m128`^[a]	`66 0F 7D /r`	Yes	Yes	No	—	—	—	—
Packed floating-point add/subtract in alternating lanes. Even-numbered lanes (counting from 0) do subtract, odd-numbered lanes do add.	32-bit	`(V)ADDSUBPS xmm,xmm/m128`	`F2 0F D0 /r`	Yes	Yes	No	—	—	—	—
	64-bit	`(V)ADDSUBPD xmm,xmm/m128`	`66 0F D0 /r`	Yes	Yes	No	—	—	—	—
Vector load from unaligned memory with looser semantics than `(V)MOVDQU`. Unlike `(V)MOVDQU`, it may fetch data more than once or, for a misaligned access, fetch additional data up until the next 16/32-byte alignment boundaries below/above the actually-requested data.		`(V)LDDQU xmm,m128`	`F2 0F F0 /r`	Yes	Yes	No	—	—	—	—
Added with SSE4.1
Vector logical test. Sets ZF=1 if bitwise-AND between first operand and second operand results in all-0s, ZF=0 otherwise. Sets CF=1 if bitwise-AND between second operand and bitwise-NOT of first operand results in all-0s, CF=0 otherwise		`(V)PTEST xmm,xmm/m128`	`66 0F38 17 /r`	Yes	Yes	No^[d]	—	—	—	—
Variable blend packed floating-point values. For each lane of the result, pick the value from either the first or the second argument depending on the top bit of the corresponding lane of `XMM0`.	32-bit	`BLENDVPS xmm,xmm/m128` `BLENDVPS xmm,xmm/m128,XMM0`^[e]	`66 0F38 14 /r`	Yes	No^[f]	No	—	—	—	—
	64-bit	`BLENDVPD xmm,xmm/m128` `BLENDVPD xmm,xmm/m128,XMM0`^[e]	`66 0F38 15 /r`	Yes	No^[f]	No	—	—	—	—
Rounding of packed floating-point values to integer. Rounding mode specified by imm8 argument.	32-bit	`(V)ROUNDPS xmm,xmm/m128,imm8`	`66 0F3A 08 /r ib`	Yes	Yes	No^[g]	—	—	—	—
	64-bit	`(V)ROUNDPD xmm,xmm/m128,imm8`	`66 0F3A 09 /r ib`	Yes	Yes	No^[g]	—	—	—	—
Rounding of scalar floating-point value to integer.	32-bit	`(V)ROUNDSS xmm,xmm/m128,imm8`	`66 0F3A 0A /r ib`	Yes	Yes	No^[g]	—	—	—	—
Rounding of scalar floating-point value to integer.	64-bit	`(V)ROUNDSD xmm,xmm/m128,imm8`	`66 0F3A 0B /r ib`	Yes	Yes	No^[g]	—	—	—	—
Blend packed floating-point values. For each lane of the result, pick the value from either the first or the second argument depending on the corresponding imm8 bit.	32-bit	`(V)BLENDPS xmm,xmm/m128,imm8`	`66 0F3A 0C /r ib`	Yes	Yes	No	—	—	—	—
	64-bit	`(V)BLENDPD xmm,xmm/m128,imm8`	`66 0F3A 0D /r ib`	Yes	Yes	No	—	—	—	—
Extract 32-bit lane of XMM register to general-purpose register or memory location. Bits[1:0] of imm8 is used to select lane.		`(V)EXTRACTPS r/m32,xmm,imm8`	`66 0F3A 17 /r ib`	Yes	Yes (L=0)	Yes (L=0)	F	No	No	No
Obtain 32-bit value from source XMM register or memory, and insert into the specified lane of destination XMM register. If the source argument is an XMM register, then bits[7:6] of the imm8 is used to select which 32-bit lane to select source from, otherwise the specified 32-bit memory value is used. This 32-bit value is then inserted into the destination register lane specified by bits[5:4] of the imm8. After insertion, each 32-bit lane of the destination register may optionally be zeroed out - bits[3:0] of the imm8 provides a bitmap of which lanes to zero out.		`(V)INSERTPS xmm,xmm/m32,imm8`	`66 0F3A 21 /r ib`	Yes	Yes (L=0)	Yes (L=0,W=0)	F	No	No	No
4-component dot-product of 32-bit floating-point values. Bits [7:4] of the imm8 specify which lanes should participate in the dot-product, bits[3:0] specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)		`(V)DPPS xmm,xmm/m128,imm8`^[a]	`66 0F3A 40 /r ib`	Yes	Yes	No	—	—	—	—
2-component dot-product of 64-bit floating-point values. Bits [5:4] of the imm8 specify which lanes should participate in the dot-product, bits[1:0] specify which lanes in the result should receive the dot-product (remaining lanes are filled with zeros)		`(V)DPPD xmm,xmm/m128,imm8`^[a]	`66 0F3A 41 /r ib`	Yes	Yes	No	—	—	—	—
Added with SSE4a (AMD only)
64-bit bitfield insert, using the low 64 bits of XMM registers. First argument is an XMM register to insert bitfield into, second argument is a source register containing the bitfield to insert (starting from bit 0). For the 4-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bit-offset to insert bitfield at. For the 2-argument version, the length and offset are instead taken from bits [69:64] and [77:72] of the second argument, respectively.		`INSERTQ xmm,xmm,imm8,imm8`	`F2 0F 78 /r ib ib`	Yes	No	No^[h]	—	—	—	—
		`INSERTQ xmm,xmm`	`F2 0F 79 /r`	Yes	No	No^[h]	—	—	—	—
64-bit bitfield extract, from the lower 64 bits of an XMM register. The first argument serves as both source that bitfield is extracted from and destination that bitfield is written to. For the 3-argument version, the first imm8 specifies bitfield length and the second imm8 specifies bitfield bit-offset. For the 2-argument version, the second argument is an XMM register that contains bitfield length at bits[5:0] and bit-offset at bits[13:8].		`EXTRQ xmm,imm8,imm8`	`66 0F 78 /0 ib ib`	Yes	No	No^[h]	—	—	—	—
		`EXTRQ xmm,xmm`	`66 0F 79 /r`	Yes	No	No^[h]	—	—	—	—

[a]
For the VPSHUFLW, VPSHUFHW, VHADDP*, VHSUBP*, VDPPS and VDPPD instructions, encodings with a vector-length wider than 128 bits are available under AVX2 and/or AVX-512, but the operation of such encodings is split into 128-bit lanes where each 128-bit lane internally performs the same operation as the 128-bit variant of the instruction.
[b]
Under AVX, the VPSHUFHW and VPSHUFLW instructions are only available in 128-bit forms - the 256-bit forms of these instructions require AVX2.
[c]
For the EVEX-encoded form of VCVTDQ2PD, EVEX embedded rounding controls are permitted but have no effect.
[d]
Opcode not EVEX-encodable. Performing a vector logical test under AVX-512 requires a sequence of at least 2 instructions, e.g. VPTESTMD followed by KORTESTW.
[e]
Assemblers may accept the BLENDVPS/BLENDVPD instructions with or without XMM0 as a third argument.
[f]
While AVX does provide VBLENDVPS/VPD instruction that are similar in function to BLENDVPS/VPD, they uses a different opcode and operand encoding - VEX.66.0F3A.W0 4A/4B /r /is4.
[g]
Opcode not available under AVX-512. Instead, AVX512F provides different opcodes - EVEX.66.0F3A (08..0B) /r ib - for its new VRNDSCALE* rounding instructions.
[h]
Under AVX-512, EVEX-encoding the INSERTQ/EXTRQ opcodes result in AVX-512 instructions completely unrelated to SSE4a, namely VCVT(T)P(S|D)2UQQ and VCVT(T)S(S|D)2USI.

Remove ads

AVX/AVX2 instructions, and AVX-512 extended variants thereof

Summarize

Perspective

This covers instructions/opcodes that are new to AVX and AVX2.

AVX and AVX2 also include extended VEX-encoded forms of a large number of MMX/SSE instructions - please see tables above.

Some of the AVX/AVX2 instructions also exist in extended EVEX-encoded forms under AVX-512 as well.

AVX instructions

More information Instruction description, Instruction mnemonics ...

Instruction description		Instruction mnemonics	Basic opcode (VEX)	AVX	AVX-512 (EVEX-encoded)
Instruction description		Instruction mnemonics	Basic opcode (VEX)	AVX	supported	subset	lane	bcst
Zero out upper bits of YMM/ZMM registers.^[a] Zeroes out all bits except bits 127:0 of ymm0 to ymm15.		`VZEROUPPER`	`VEX.NP.0F 77`^[b]	(L=0)	No	—	—	—
Zero out YMM/ZMM registers.^[a] Zeroes out registers ymm0 to ymm15.		`VZEROALL`	`VEX.NP.0F 77`^[b]	(L=1)	No	—	—	—

Broadcast floating-point data from memory or bottom of XMM-register to all lanes of XMM/YMM/ZMM-register.	32-bit	`VBROADCASTSS ymm,xmm/m32`^[c]	`VEX.66.0F38.W0 18 /r`	Yes	Yes	F	32	(32)^[d]
	64-bit	`VBROADCASTSD ymm,xmm/m64`^[c] `VBROADCASTF32X2 zmm,xmm/m64`(AVX-512)	`VEX.66.0F38 19 /r`	`VBROADCASTSD` (L=1^[e],W=0)	`VBROADCASTF32X2`(L≠0,W=0)	DQ	32	(64)^[d]
	64-bit		`VEX.66.0F38 19 /r`	`VBROADCASTSD` (L=1^[e],W=0)	`VBROADCASTSD`(L≠0,W=1)	F	64	(64)^[d]
	128-bit	`VBROADCASTF128 ymm,m128` `VBROADCASTF32X4 zmm,m128`(AVX-512) `VBROADCASTF64X2 zmm,m128`(AVX-512)	`VEX.66.0F38 1A /r`	`VBROADCASTF128` (L=1,W=0)	`VBROADCASTF32X4`(L≠0,W=0)	F	32	(128)^[d]
	128-bit		`VEX.66.0F38 1A /r`	`VBROADCASTF128` (L=1,W=0)	`VBROADCASTF64X2`(L≠0,W=1)	DQ	64	(128)^[d]

Extract 128-bit vector-lane of floating-point data from wider vector-register		`VEXTRACTF128 xmm/m128,ymm,imm8` `VEXTRACTF32X4 xmm/m128,zmm,imm8`(AVX-512) `VEXTRACTF64X2 xmm/m128,zmm,imm8`(AVX-512)	`VEX.66.0F3A 19 /r ib`	`VEXTRACTF128` (L=1,W=0)	`VEXTRACTF32X4`(L≠0,W=0)	F	32	No
			`VEX.66.0F3A 19 /r ib`	`VEXTRACTF128` (L=1,W=0)	`VEXTRACTF64X2`(L≠0,W=1)	DQ	64	No
Insert 128-bit vector of floating-point data into 128-bit lane of wider vector		`VINSERTF128 ymm,ymm,xmm/m128,imm8` `VINSERTF32X4 zmm,zmm,xmm/m128,imm8`(AVX-512) `VINSERTF64X2 zmm,zmm,xmm/m128,imm8`(AVX-512)	`VEX.66.0F3A 18 /r ib`	`VINSERTF128` (L=1,W=0)	`VINSERTF32X4`(L≠0,W=0)	F	32	No
			`VEX.66.0F3A 18 /r ib`	`VINSERTF128` (L=1,W=0)	`VINSERTF64X2`(L≠0,W=1)	DQ	64	No
Concatenate the two source vectors into a vector of four 128-bit components, then use imm8 to index into vector Bits [1:0] of imm8 picks element to use for low 128 bits of result Bits[3:2] of imm8 picks element to use for high 128 bits of result		`VPERM2F128 ymm,ymm,ymm/m256,imm8`	`VEX.66.0F3A.W0 06 /r /ib`	(L=1)	No	—	—	—
Perform shuffle of 32-bit sub-lanes within each 128-bit lane of vectors. Variable-shuffle form uses bits[1:0] of each lane for selection. imm8 form uses same shuffle in every 128-bit lane.		`VPERMILPS ymm,ymm,ymm/m256`	`VEX.66.0F38.W0 0C /r`	Yes	Yes	F	32	32
		`VPERMILPS ymm,ymm/m256,imm8`	`VEX.66.0F3A.W0 04 /r ib`	Yes	Yes	F	32	32
Perform shuffle of 64-bit sub-lanes within each 128-bit lane of vectors. Variable-shuffle form uses bit[1] of each lane for selection. imm8 form uses two bits of the imm8 for each of the 128-bit lanes.		`VPERMILPD ymm,ymm,ymm/m256`	`VEX.66.0F38.W0 0D /r`	Yes	Yes	F	64	64
		`VPERMILPD ymm,ymm/m256,imm8`	`VEX.66.0F3A.W0 05 /r ib`	Yes	Yes	F	64	64

Packed memory load/store of floating-point data with per-lane write masking. First argument is destination, third argument is source. The second argument provides masks, in the top bit of each 32-bit lane.	32-bit	`VMASKMOVPS ymm,ymm,m256`	`VEX.66.0F38.W0 2C /r`	Yes	No^[f]	—	—	—
	32-bit	`VMASKMOVPS m256,ymm,ymm`	`VEX.66.0F38.W0 2E /r`	Yes	No^[f]	—	—	—
	64-bit	`VMASKMOVPD ymm,ymm,m256`	`VEX.66.0F38.W0 2D /r`	Yes	No^[f]	—	—	—
	64-bit	`VMASKMOVPD m256,ymm,ymm`	`VEX.66.0F38.W0 2F /r`	Yes	No^[f]	—	—	—
Variable blend packed floating-point values. For each lane of the result, pick the value from either the second or the third argument depending on the top bit of the corresponding lane of the fourth argument.	32-bit	`VBLENDVPS ymm,ymm,ymm/m256,ymm`	`VEX.66.0F3A.W0 4A /r /is4`	Yes	No	—	—	—
	64-bit	`VBLENDVPD ymm,ymm,ymm/m256,ymm`	`VEX.66.0F3A.W0 4B /r /is4`	Yes	No	—	—	—
Variable blend packed bytes. For each byte lane of the result, pick the value from either the second or the third argument depending on the top bit of the corresponding byte lane of the fourth argument.		`VPBLENDVB xmm,xmm,xmm/m128,xmm`^[g]	`VEX.66.0F3A.W0 4C /r is4`	Yes	No	—	—	—

Vector logical sign-bit test on packed floating-point values. Sets ZF=1 if bitwise-AND between sign-bits of the first operand and second operand results in all-0s, ZF=0 otherwise. Sets CF=1 if bitwise-AND between sign-bits of second operand and bitwise-NOT of first operand results in all-0s, CF=0 otherwise.	32-bit	`VTESTPS ymm,ymm/m256`	`VEX.66.0F38.W0 0E /r`	Yes	No	—	—	—
	64-bit	`VTESTPD ymm,ymm/m256`	`VEX.66.0F38.W0 0F /r`	Yes	No	—	—	—

[a]
For code that may potentially mix use of legacy-SSE instructions with 256-bit AVX instructions, it is strongly recommended to execute a VZEROUPPER or VZEROALL instruction after executing AVX instructions but before executing SSE instructions. If this is not done, any subsequent legacy-SSE code may be subject to severe performance degradation.^[18]
[b]
While the VZEROUPPER and VZEROALL instructions are architecturally listed as ignoring the VEX.W bit, some early AVX implementations (e.g. Sandy Bridge^[19]) will #UD if the VZEROUPPER and VZEROALL instructions are encoded with VEX.W=1. For this reason, it is recommended to encode these instructions with VEX.W=0.
[c]
VBROADCASTSS and VBROADCASTSD with a register source operand are not supported under AVX - support for xmm-register source operands for these instructions was added in AVX2.
[d]
The V(P)BROADCAST* instructions perform broadcast as part of their normal operation - under AVX-512 with EVEX prefix, they do not require or accept the EVEX.b modifier.
[e]
The VBROADCASTSD instruction does not support broadcast of 64-bit data into a 128-bit vector. For broadcast of 64-bit data into a 128-bit vector, the SSE3 (V)MOVDDUP instruction or the AVX2 VPBROADCASTQ instruction may be used.
[f]
Under AVX-512, EVEX-encoded forms of the VMASKMOVP(S|D) instructions are not available. For masked moves of FP32/FP64 values to/from memory under AVX-512, the VMOVUPS and VMOVUPD may be used with an opmask register.
[g]
Under AVX, the VPBLENDVB instruction is only available with a 128-bit vector width (VEX.L=0). Support for 256-bit vector width was added in AVX2.

AVX2 instructions

More information Instruction description, Instruction mnemonics ...

Instruction description			Instruction mnemonics	Basic opcode (VEX)	AVX2	AVX-512 (EVEX-encoded)
Instruction description			Instruction mnemonics	Basic opcode (VEX)	AVX2	supported	subset	lane	bcst
Broadcast integer data from memory or bottom lane of XMM-register to all lanes of XMM/YMM/ZMM register		8-bit	`VPBROADCASTB ymm,xmm/m8`	`VEX.66.0F38.W0 78 /r`	Yes	Yes^[a]	BW	8	(8)^[b]
		16-bit	`VPBROADCASTW ymm,xmm/m16`	`VEX.66.0F38.W0 79 /r`	Yes	Yes^[a]	BW	16	(16)^[b]
		32-bit	`VPBROADCASTD ymm,xmm/m32`	`VEX.66.0F38.W0 58 /r`	Yes	Yes^[a]	F	32	(32)^[b]
		64-bit	`VPBROADCASTQ ymm,xmm/m64` `VBROADCASTI32X2 zmm,xmm/m64`(AVX-512)	`VEX.66.0F38 59 /r`	`VPBROADCASTQ` (W=0)	`VBROADCASTI32X2`(W=0)	DQ	32	(64)^[b]
		64-bit		`VEX.66.0F38 59 /r`	`VPBROADCASTQ` (W=0)	`VPBROADCASTQ`(W=1)^[a]	F	64	(64)^[b]
		128-bit	`VBROADCASTI128 ymm,m128` `VBROADCASTI32X4 zmm,m128`(AVX-512) `VBROADCASTI64X2 zmm,m128`(AVX-512)	`VEX.66.0F38 5A /r`	`VBROADCASTI128` (L=1,W=0)	`VBROADCASTI32X4`(L≠0,W=0)	F	32	(128)^[b]
		128-bit		`VEX.66.0F38 5A /r`	`VBROADCASTI128` (L=1,W=0)	`VBROADCASTI64X2`(L≠0,W=1)	DQ	64	(128)^[b]

Extract 128-bit vector-lane of integer data from wider vector-register			`VEXTRACTI128 xmm/m128,ymm,imm8` `VEXTRACTI32X4 xmm/m128,zmm,imm8`(AVX-512) `VEXTRACTI64X2 xmm/m128,zmm,imm8`(AVX-512)	`VEX.66.0F3A 39 /r ib`	`VEXTRACTI128` (L=1,W=0)	`VEXTRACTI32X4`(L≠0,W=0)	F	32	No
				`VEX.66.0F3A 39 /r ib`	`VEXTRACTI128` (L=1,W=0)	`VEXTRACTI64X2`(L≠0,W=1)	DQ	64	No
Insert 128-bit vector of integer data into lane of wider vector			`VINSERTI128 ymm,ymm,xmm/m128,imm8` `VINSERTI32X4 ymm,ymm,xmm/m128,imm8`(AVX-512) `VINSERTI64X2 ymm,ymm,xmm/m128,imm8`(AVX-512)	`VEX.66.0F3A 38 /r ib`	`VINSERTI128` (L=1,W=0)	`VINSERTI32X4`(L≠0,W=0)	F	32	No
				`VEX.66.0F3A 38 /r ib`	`VINSERTI128` (L=1,W=0)	`VINSERTI64X2`(L≠0,W=1)	DQ	64	No
Concatenate the two source vectors into a vector of four 128-bit components, then use imm8 to index into vector Bits [1:0] of imm8 picks element to use for low 128 bits of result Bits[3:2] of imm8 picks element to use for high 128 bits of result			`VPERM2I128 ymm,ymm,ymm/m256,imm8`	`VEX.66.0F3A.W0 46 /r ib`	(L=1)	No	—	—	—
Perform shuffle of FP64 values in vector			`VPERMPD ymm,ymm/m256,imm8`	`VEX.66.0F3A.W1 01 /r ib`	(L=1)^[c]	Yes (L≠0)^[d]	F	64	64
Perform shuffle of 64-bit integers in vector			`VPERMQ ymm,ymm/m256,imm8`	`VEX.66.0F3A.W1 00 /r ib`	(L=1)^[c]	Yes (L≠0)^[d]	F	64	64
Perform variable shuffle of FP32 values in vector			`VPERMPS ymm,ymm,ymm/m256`	`VEX.66.0F38.W0 16 /r`	(L=1)^[c]	Yes (L≠0)	F	32	32
Perform variable shuffle of 32-bit integers in vector			`VPERMD ymm,ymm,ymm/m256`	`VEX.66.0F38.W0 36 /r`	(L=1)^[c]	Yes (L≠0)	F	32	32

Packed memory load/store of integer data with per-lane write masking. First argument is destination, third argument is source. The second argument provides masks, in the top bit of each lane.		32-bit	`VPMASKMOVD ymm,ymm,m256`	`VEX.66.0F38.W0 8C /r`	Yes	No	—	—	—
		32-bit	`VPMASKMOVD m256,ymm,ymm`	`VEX.66.0F38.W0 8E /r`	Yes	No	—	—	—
		64-bit	`VPMASKMOVQ ymm,ymm,m256`	`VEX.66.0F38.W1 8C /r`	Yes	No	—	—	—
		64-bit	`VPMASKMOVQ m256,ymm,ymm`	`VEX.66.0F38.W1 8E /r`	Yes	No	—	—	—
Blend packed 32-bit integer values. For each 32-bit lane of result, pick value from second or third argument depending on the corresponding bit in the imm8 argument.			`VPBLENDD ymm,ymm,ymm/m256,imm8`	`VEX.66.0F3A.W0 02 /r ib`	Yes	No	—	—	—

Left-shift packed integers, with per-lane shift-amount		32-bit	`VPSLLVD ymm,ymm,xmm/m256`	`VEX.66.0F38.W0 47 /r`	Yes	Yes	F	32	32
Left-shift packed integers, with per-lane shift-amount		64-bit	`VPSLLVQ ymm,ymm,xmm/m256`	`VEX.66.0F38.W1 47 /r`	Yes	Yes	F	32	64
Right-shift packed signed integers, with per-lane shift-amount		32-bit	`VPSRAVD ymm,ymm,ymm/m256`	`VEX.66.0F38 46 /r`	`VPSRAVD` (W=0)	`VPSRAVD`(W=0)	F	32	32
		64-bit	`VPSRAVQ zmm,zmm,zmm/m512`(AVX-512)	`VEX.66.0F38 46 /r`	`VPSRAVD` (W=0)	`VPSRAVQ`(W=1)	F	64	64
Right-shift packed unsigned integers, with per-lane shift-amount		32-bit	`VPSRLVD ymm,ymm,ymm/m256`	`VEX.66.0F38.W0 45 /r`	Yes	Yes	F	32	32
		64-bit	`VPSRLVQ ymm,ymm,ymm/m256`	`VEX.66.0F38.W5 45 /r`	Yes	Yes	F	64	64

Conditional vector memory gather. For each 32/64-bit component of a given input vector register, treat the component as an index for an x86 SIB `base+scale*index+displacement` address calculation, then load a 32/64-bit data item from the computed memory address. The third argument to the instruction is a mask argument - for each destination vector lane, a memory load is only performed if the MSB of the corresponding mask-argument lane is set to 1. For each load, the corresponding mask-argument lane is zeroed out.^[e]	s32→i32		`VPGATHERDD ymm1,vm32y,ymm2`	`VEX.66.0F38.W0 90 /r /vsib`	Yes	Yes^[e]	F	32	No
	s32→i64		`VPGATHERDQ ymm1,vm32x,ymm2`	`VEX.66.0F38.W1 90 /r /vsib`	Yes	Yes^[e]	F	64	No
	s64→i32		`VPGATHERQD xmm1,vm64y,xmm2`	`VEX.66.0F38.W0 91 /r /vsib`	Yes	Yes^[e]	F	32	No
	s64→i64		`VPGATHERQQ ymm1,vm64y,ymm2`	`VEX.66.0F38.W1 91 /r /vsib`	Yes	Yes^[e]	F	64	No
	s32→fp32		`VGATHERDPS ymm1,vm32y,ymm2`	`VEX.66.0F38.W0 92 /r /vsib`	Yes	Yes^[e]	F	32	No
	s32→fp64		`VGATHERDPD ymm1,vm32x,ymm2`	`VEX.66.0F38.W1 92 /r /vsib`	Yes	Yes^[e]	F	64	No
	s64→fp32		`VGATHERQPS ymm1,vm64y,ymm2`	`VEX.66.0F38.W0 93 /r /vsib`	Yes	Yes^[e]	F	32	No
	s64→fp64		`VGATHERQPD ymm1,vm64x,ymm2`	`VEX.66.0F38.W1 93 /r /vsib`	Yes	Yes^[e]	F	64	No

[a]
For AVX-512, variants of the VPBROADCAST(B/W/D/Q) instructions that can use a general-purpose register as source exist as well, with opcodes EVEX.66.0F38.W0 (7A..7C)
[b]
The V(P)BROADCAST* instructions perform broadcast as part of their normal operation - under AVX-512 with EVEX prefix, they do not require or accept the EVEX.b modifier.
[c]
For VPERMPS, VPERMPD, VPERMD and VPERMQ, minimum supported vector width is 256 bits. For shuffles in a 128-bit vector, use VPERMILPS or VPERMILPD.
[d]
Under AVX-512, executing the VPERMPD and VPERMQ instructions with a vector width of 512 bits will cause the operation to be split into two 256-bit halves, with the imm8 swizzle being applied to each half separately.
Under AVX-512, variable-shuffle variants of the VPERMPD and VPERMQ instructions exist with opcodes EVEX.66.0F38.W1 16 /r and EVEX.66.0F38.W1 36 /r, respectively - these variants do not split their operation into 256-bit halves.
[e]
For EVEX-encoded forms of the V(P)GATHER* instructions under AVX-512, lane-masking is done with an opmask register instead of an XMM/YMM/ZMM vector register.

Other VEX-encoded SIMD instructions

SIMD instructions set extensions that are using the VEX prefix, and are not considered part of baseline AVX/AVX2/AVX-512, FMA3/4 or AMX.

Integer, opmask and cryptographic instructions that use the VEX prefix (e.g. the BMI2, CMPccXADD, VAES and SHA512 extensions) are not included.

More information Instruction set extension, Instruction description ...

Instruction set extension	Instruction description		Instruction mnemonics	Basic opcode (VEX)	AVX-512 (EVEX-encoded)					Added in
Instruction set extension	Instruction description		Instruction mnemonics	Basic opcode (VEX)	supp.	subset	lane	bcst	rc/sae	Added in
F16C Packed conversions between FP16 and FP32	Packed conversion from FP16 to FP32.		`VCVTPH2PS ymm1,xmm2/m128`	`VEX.66.0F38.W0 13 /r`	Yes	F	32	16	SAE	Ivy Bridge, Piledriver, Jaguar, Nano QuadCore C4000, ZhangJiang
F16C Packed conversions between FP16 and FP32	Packed conversion from FP32 to FP16. Imm8 argument provides rounding controls.^[a]		`VCVTPS2PH xmm1,ymm2/m256,imm8`	`VEX.66.0F3A.W0 1D /r ib`	Yes	F	16	No	SAE

AVX-VNNI Vector Neural Network Instructions	For each 32-bit lane, compute an integer dot-product of 8-bit components from the two source arguments (first unsigned, second signed), then add that dot-product result to an accumulator.	no saturation	`VPDPBUSD ymm1,ymm2,ymm3/m256`	`VEX.66.0F38.W0 50 /r`	Yes	VNNI	32	32	No	AVX512_VNNI: Cascade Lake, Zen 4 AVX-VNNI: Alder Lake, Sapphire Rapids, Zen 5
		signed saturation	`VPDPBUSDS ymm1,ymm2,ymm3/m256`	`VEX.66.0F38.W0 51 /r`	Yes	VNNI	32	32	No
	For each 32-bit lane, compute an integer dot-product of 16-bit components from the two source arguments (both signed), then add the dot-product result to an accumulator.	no saturation	`VPDPWSSD ymm1,ymm2,ymm3/m256`	`VEX.66.0F38.W0 52 /r`	Yes	VNNI	32	32	No
		signed saturation	`VPDPWSSDS ymm1,ymm2,ymm3/m256`	`VEX.66.0F38.W0 53 /r`	Yes	VNNI	32	32	No

AVX-IFMA Integer Fused Multiply Add	For each 64-bit lane, perform an unsigned multiply of the bottom 52 bits of each of the two source arguments, then extract either the low half or the high half of the 104-bit product as an unsigned 52-bit integer that is then added to the corresponding 64-bit lane in the destination register.	low half	`VPMADD52LUQ ymm1,ymm2,ymm3/m256`	`VEX.66.0F38.W1 B4 /r`	Yes	IFMA	64	64	No	AVX512_IFMA: Cannon Lake, Ice Lake, Zen 4 AVX-IFMA: Lunar Lake, Arrow Lake
AVX-IFMA Integer Fused Multiply Add		high half	`VPMADD52HUQ ymm1,ymm2,ymm3/m256`	`VEX.66.0F38.W1 B5 /r`	Yes	IFMA	64	64	No

AVX-NE-CONVERT No-exception FP16/BF16 conversion instructions	Convert packed FP32 to packed BF16 with round-to-nearest-even		`VCVTNEPS2BF16 xmm1,ymm2/m256`	`VEX.F3.0F38.W0 72 /r`	Yes	BF16^[b]	16	32	No	AVX512_BF16: Cooper Lake, Zen 4, Sapphire Rapids AVX-NE-CONVERT: Lunar Lake, Arrow Lake
	Load a vector of packed FP16 or BF16 values from memory, then convert all the even or odd elements in that vector (depending on instruction) to packed FP32 values.	BF16, even	`VCVTNEEBF162PS ymm,m256`	`VEX.F3.0F38.W0 B0 /r`	No	—	—	—	—
		FP16, even	`VCVTNEEPH2PS ymm,m256`	`VEX.66.0F38.W0 B0 /r`	No	—	—	—	—
		BF16, odd	`VCVTNEOBF162PS ymm,m256`	`VEX.F2.0F38.W0 B0 /r`	No	—	—	—	—
		FP16, odd	`VCVTNEOPH2PS ymm,m256`	`VEX.NP.0F38.W0 B0 /r`	No	—	—	—	—
	Load scalar FP16 or BF16 value from memory, convert to FP32, then broadcast to destination vector register.	BF16	`VBCSTNEBF162PS ymm,m16`	`VEX.F3.0F38.W0 B1 /r`	No	—	—	—	—
		FP16	`VBCSTNESH2PS ymm,m16`	`VEX.66.0F38.W0 B1 /r`	No	—	—	—	—

AVX-VNNI-INT8	For each 32-bit lane, compute an integer dot-product of four 8-bit components from the two source arguments, then add the dot-product result to an accumulator. Each of the two source arguments may have ther components treated as either signed or unsigned; the addition to the accumulator may be done with or without saturation (signed or unsigned) depending on instruction.	s8*s8	`VPDPBSSD ymm1,ymm2,ymm3/m256`	`VEX.F2.0F38.W0 50 /r`	No	—	—	—	—	Lunar Lake, Arrow Lake
		s8*s8, ssat	`VPDPBSSDS ymm1,ymm2,ymm3/m256`	`VEX.F2.0F38.W0 51 /r`	No	—	—	—	—
		s8*u8	`VPDPBSUD ymm1,ymm2,ymm3/m256`	`VEX.F3.0F38.W0 50 /r`	No	—	—	—	—
		s8*u8, ssat	`VPDPBSUDS ymm1,ymm2,ymm3/m256`	`VEX.F3.0F38.W0 50 /r`	No	—	—	—	—
		u8*u8	`VPDPBUUD ymm1,ymm2,ymm3/m256`	`VEX.NP.0F38.W0 50 /r`	No	—	—	—	—
		u8*u8, usat	`VPDPBUUDS ymm1,ymm2,ymm3/m256`	`VEX.NP.0F38.W0 50 /r`	No	—	—	—	—

AVX-VNNI-INT16	For each 32-bit lane, compute an integer dot-product of two 16-bit components from the two source arguments, then add the dot-product result to an accumulator. Each of the two source arguments may have their components treated as either signed or unsigned; the addition to the accumulator may be done with or without saturation (signed or unsigned) depending on instruction.	s16*u16	`VPDPWSUD ymm1,ymm2,ymm3/m256`	`VEX.F3.0F38.W0 D2 /r`	No	—	—	—	—	Lunar Lake, Arrow Lake-S
		s16*u16, ssat	`VPDPWSUDS ymm1,ymm2,ymm3/m256`	`VEX.F3.0F38.W0 D3 /r`	No	—	—	—	—
		u16*s16	`VPDPWUSD ymm1,ymm2,ymm3/m256`	`VEX.66.0F38.W0 D2 /r`	No	—	—	—	—
		u16*s16, ssat	`VPDPWUSDS ymm1,ymm2,ymm3/m256`	`VEX.66.0F38.W0 D3 /r`	No	—	—	—	—
		u16*u16	`VPDPWUUD ymm1,ymm2,ymm3/m256`	`VEX.NP.0F38.W0 D2 /r`	No	—	—	—	—
		u16*u16, usat	`VPDPWUUDS ymm1,ymm2,ymm3/m256`	`VEX.NP.0F38.W0 D3 /r`	No	—	—	—	—

[a]
For the VCVTPS2PH instruction, if bit 2 if the imm8 argument is set, then the rounding mode to use is taken from the MXCSR, else the rounding mode is taken from bits 1:0 of the imm8 (the top 5 bits of the imm8 are ignored). The supported rounding modes are:
More information Value, Rounding mode ...

Value Rounding mode

0 Round to nearest even

1 Round down

2 Round up

3 Round to zero

Close
[b]
The VCVTNEPS2BF16 is the only AVX512_BF16 instruction for which the AVX-NE-CONVERT extension provides a VEX-encoded form. The other AVX512_BF16 instructions (none of which have any VEX-encoded forms) are not listed here.

Value	Rounding mode
0	Round to nearest even
1	Round down
2	Round up
3	Round to zero

Remove ads

FMA3 and FMA4 instructions

Summarize

Perspective

Floating-point fused multiply-add instructions are introduced in x86 as two instruction set extensions, "FMA3" and "FMA4", both of which build on top of AVX to provide a set of scalar/vector instructions using the xmm/ymm/zmm vector registers. FMA3 defines a set of 3-operand fused-multiply-add instructions that take three input operands and writes its result back to the first of them. FMA4 defines a set of 4-operand fused-multiply-add instructions that take four input operands – a destination operand and three source operands.

FMA3 is supported on Intel CPUs starting with Haswell, on AMD CPUs starting with Piledriver, and on Zhaoxin CPUs starting with YongFeng. FMA4 was only supported on AMD Family 15h (Bulldozer) CPUs and has been abandoned from AMD Zen onwards. The FMA3/FMA4 extensions are not considered to be an intrinsic part of AVX or AVX2, although all Intel and AMD (but not Zhaoxin) processors that support AVX2 also support FMA3. FMA3 instructions (in EVEX-encoded form) are, however, AVX-512 foundation instructions.
The FMA3 and FMA4 instruction sets both define a set of 10 fused-multiply-add operations, all available in FP32 and FP64 variants. For each of these variants, FMA3 defines three operand orderings while FMA4 defines two.
FMA3 encoding
FMA3 instructions are encoded with the VEX or EVEX prefixes – on the form VEX.66.0F38 xy /r or EVEX.66.0F38 xy /r. The VEX.W/EVEX.W bit selects floating-point format (W=0 means FP32, W=1 means FP64). The opcode byte xy consists of two nibbles, where the top nibble x selects operand ordering (9='132', A='213', B='231') and the bottom nibble y (values 6..F) selects which one of the 10 fused-multiply-add operations to perform. (x and y outside the given ranges will result in something that is not an FMA3 instruction.)
At the assembly language level, the operand ordering is specified in the mnemonic of the instruction:

vfmadd132sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm1*xmm3)+xmm2
vfmadd213sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm1)+xmm3
vfmadd231sd xmm1,xmm2,xmm3 will perform xmm1 ← (xmm2*xmm3)+xmm1

For all FMA3 variants, the first two arguments must be xmm/ymm/zmm vector register arguments, while the last argument may be either a vector register or memory argument. Under AVX-512 and AVX10, the EVEX-encoded variants support EVEX-prefix-encoded broadcast, opmasks and rounding-controls.
The AVX512-FP16 extension, introduced in Sapphire Rapids, adds FP16 variants of the FMA3 instructions – these all take the form EVEX.66.MAP6.W0 xy /r with the opcode byte working in the same way as for the FP32/FP64 variants. The AVX10.2 extension, published in 2024,^[20] similarly adds BF16 variants of the packed (but not scalar) FMA3 instructions – these all take the form EVEX.NP.MAP6.W0 xy /r with the opcode byte again working similar to the FP32/FP64 variants. (For the FMA4 instructions, no FP16 or BF16 variants are defined.)
FMA4 encoding
FMA4 instructions are encoded with the VEX prefix, on the form VEX.66.0F3A xx /r ib (no EVEX encodings are defined). The opcode byte xx uses its bottom bit to select floating-point format (0=FP32, 1=FP64) and the remaining bits to select one of the 10 fused-multiply-add operations to perform.

For FMA4, operand ordering is controlled by the VEX.W bit. If VEX.W=0, then the third operand is the r/m operand specified by the instruction's ModR/M byte and the fourth operand is a register operand, specified by bits 7:4 of the ib (8-bit immediate) part of the instruction. If VEX.W=1, then these two operands are swapped. For example:

vfmaddsd xmm1,xmm2,[mem],xmm3 will perform xmm1 ← (xmm2*[mem])+xmm3 and require a W=0 encoding.
vfmaddsd xmm1,xmm2,xmm3,[mem] will perform xmm1 ← (xmm2*xmm3)+[mem] and require a W=1 encoding.
vfmaddsd xmm1,xmm2,xmm3,xmm4 will perform xmm1 ← (xmm2*xmm3)+xmm4 and can be encoded with either W=0 or W=1.

Opcode table
The 10 fused-multiply-add operations and the 122 instruction variants they give rise to are given by the following table – with FMA4 instructions highlighted with * and yellow cell coloring, and FMA3 instructions not highlighted:

More information Basic operation, Opcode byte ...

Basic operation	Opcode byte	FP32 instructions	FP64 instructions	FP16 instructions (AVX512-FP16)	BF16 instructions (AVX10.2)
Packed alternating multiply-add/subtract (AB)-C* in even-numbered lanes^[a] (AB)+C* in odd-numbered lanes	`96`	`VFMADDSUB132PS`	`VFMADDSUB132PD`	`VFMADDSUB132PH`	—
	`A6`	`VFMADDSUB213PS`	`VFMADDSUB213PD`	`VFMADDSUB213PH`	—
	`B6`	`VFMADDSUB231PS`	`VFMADDSUB231PD`	`VFMADDSUB231PH`	—
	`5C/5D`*	`VFMADDSUBPS`*	`VFMADDSUBPD`*	—	—
Packed alternating multiply-subtract/add (AB)+C* in even-numbered lanes (AB)-C* in odd-numbered lanes	`97`	`VFMSUBADD132PS`	`VFMSUBADD132PD`	`VFMSUBADD132PH`	—
	`A7`	`VFMSUBADD213PS`	`VFMSUBADD213PD`	`VFMSUBADD213PH`	—
	`B7`	`VFMSUBADD231PS`	`VFMSUBADD231PD`	`VFMSUBADD231PH`	—
	`5E/5F`*	`VFMSUBADDPS`*	`VFMSUBADDPD`*	—	—
Packed multiply-add (AB)+C*	`98`	`VFMADD132PS`	`VFMADD132PD`	`VFMADD132PH`	`VFMADD132BF16`
	`A8`	`VFMADD213PS`	`VFMADD213PD`	`VFMADD213PH`	`VFMADD213BF16`
	`B8`	`VFMADD231PS`	`VFMADD231PD`	`VFMADD231PH`	`VFMADD231BF16`
	`68/69`*	`VFMADDPS`*	`VFMADDPD`*	—	—
Scalar multiply-add (AB)+C*	`99`	`VFMADD132SS`	`VFMADD132SD`	`VFMADD132SH`	—
	`A9`	`VFMADD213SS`	`VFMADD213SD`	`VFMADD213SH`	—
	`B9`	`VFMADD231SS`	`VFMADD231SD`	`VFMADD231SH`	—
	`6A/6B`*	`VFMADDSS`*	`VFMADDSD`*	—	—
Packed multiply-subtract (AB)-C*	`9A`	`VFMSUB132PS`	`VFMSUB132PD`	`VFMSUB132PH`	`VFMSUB132BF16`
	`AA`	`VFMSUB213PS`	`VFMSUB213PD`	`VFMSUB213PH`	`VFMSUB213BF16`
	`BA`	`VFMSUB231PS`	`VFMSUB231PD`	`VFMSUB231PH`	`VFMSUB231BF16`
	`6C/6D`*	`VFMSUBPS`*	`VFMSUBPD`*	—	—
Scalar multiply-subtract (AB)-C*	`9B`	`VFMSUB132SS`	`VFMSUB132SD`	`VFMSUB132SH`	—
	`AB`	`VFMSUB213SS`	`VFMSUB213SD`	`VFMSUB213SH`	—
	`BB`	`VFMSUB231SS`	`VFMSUB231SD`	`VFMSUB231SH`	—
	`6E/6F`*	`VFMSUBSS`*	`VFMSUBSD`*	—	—
Packed negative-multiply-add (-AB)+C*	`9C`	`VFNMADD132PS`	`VFNMADD132PD`	`VFNMADD132PH`	`VFNMADD132BF16`
	`AC`	`VFNMADD213PS`	`VFNMADD213PD`	`VFNMADD213PH`	`VFNMADD213BF16`
	`BC`	`VFNMADD231PS`	`VFNMADD231PD`	`VFNMADD231PH`	`VFNMADD231BF16`
	`78/79`*	`VFMADDPS`*	`VFMADDPD`*	—	—
Scalar negative-multiply-add (-AB)+C*	`9D`	`VFMADD132SS`	`VFMADD132SD`	`VFMADD132SH`	—
	`AD`	`VFMADD213SS`	`VFMADD213SD`	`VFMADD213SH`	—
	`BD`	`VFMADD231SS`	`VFMADD231SD`	`VFMADD231SH`	—
	`7A/7B`*	`VFMADDSS`*	`VFMADDSD`*	—	—
Packed negative-multiply-subtract (-AB)-C*	`9E`	`VFNMSUB132PS`	`VFNMSUB132PD`	`VFNMSUB132PH`	`VFNMSUB132BF16`
	`AE`	`VFNMSUB213PS`	`VFNMSUB213PD`	`VFNMSUB213PH`	`VFNMSUB213BF16`
	`BE`	`VFNMSUB231PS`	`VFNMSUB231PD`	`VFNMSUB231PH`	`VFNMSUB231BF16`
	`7C/7D`*	`VFNMSUBPS`*	`VFNMSUBPD`*	—	—
Scalar negative-multiply-subtract (-AB)-C*	`9F`	`VFNMSUB132SS`	`VFNMSUB132SD`	`VFNMSUB132SH`	—
	`AF`	`VFNMSUB213SS`	`VFNMSUB213SD`	`VFNMSUB213SH`	—
	`BF`	`VFNMSUB231SS`	`VFNMSUB231SD`	`VFNMSUB231SH`	—
	`7E/7F`*	`VFNMSUBSS`*	`VFNMSUBSD`*	—	—

[a]
Vector register lanes are counted from 0 upwards in a little-endian manner – the lane that contains the first byte of the vector is considered to be even-numbered.

Remove ads

AVX-512

Summarize

Perspective

AVX-512, introduced in 2014, adds 512-bit wide vector registers (extending the 256-bit registers, which become the new registers' lower halves) and doubles their count to 32; the new registers are thus named zmm0 through zmm31. It adds eight mask registers, named k0 through k7, which may be used to restrict operations to specific parts of a vector register. Unlike previous instruction set extensions, AVX-512 is implemented in several groups; only the foundation ("AVX-512F") extension is mandatory.^[21] Most of the added instructions may also be used with the 256- and 128-bit registers.

AVX-512 foundation, byte/word and doubleword/quadword instructions (F, BW and DQ subsets)

This covers instructions that are new to AVX-512's F, BW and DQ subsets.

These AVX-512 subsets also include extended EVEX-encoded forms of a large number of MMX/SSE/AVX instructions - please see tables above.

Regularly-encoded floating-point instructions

These instructions all follow a given pattern where:

EVEX.W is used to specify floating-point format (0=FP32, 1=FP64)
The bottom opcode bit is used to select between packed and scalar operation (0: packed, 1:scalar)
For a given operation, all the scalar/packed variants belong to the same AVX-512 subset.
The instructions all support result masking by opmask registers. They also all support broadcast of memory operands for packed variants.
If AVX512VL is supported, then all vector widths (128-bit, 256-bit and 512-bit) are supported for packed variants.

More information

...

Instruction description	AVX-512 subset	Basic opcode	FP32 instructions (EVEX.W=0)		FP64 instructions (EVEX.W=1)		RC/SAE
Instruction description	AVX-512 subset	Basic opcode	Packed	Scalar	Packed	Scalar	RC/SAE
Power-of-2 scaling of floating-point values. The operation of the `VSCALEF* dst,src1,src2` instructions is: ${\text{dst}}\leftarrow {\text{src1}}*2^{{\text{floor}}\left({\text{src2}}\right)}$	F	`EVEX.66.0F38 (2C/2D) /r`	`VSCALEFPS z,z,z/m512`	`VSCALEFSS x,x,x/m32`	`VSCALEFPD z,z,z/m512`	`VSCALEFSD x,x,x/m32`	SAE
Convert exponent of floating-point value to floating-point. The operation of the `VGETEXP* dst,src1` instructions is: ${\text{dst}}\leftarrow {\text{floor}}\left(\log _{2}\left\|{\text{src1}}\right\|\right)$ ^[a]	F	`EVEX.66.0F38 (42/43) /r`	`VGETEXPPS z,z/m512`	`VGETEXPSS x,x,x/m32`	`VGETEXPPD z,z/m512`	`VGETEXPSD x,x,x/m32`	SAE
Reciprocal approximation with an accuracy of $2^{-14}$ ^[b]	F	`EVEX.66.0F38 (4C/4D) /r`	`VRCP14PS z,z/m512`	`VRCP14SS x,x,x/m32`	`VRCP14PD z,z/m512`	`VRCP14SD x,x,x/m64`
Reciprocal square root approximation with an accuracy of $2^{-14}$ ^[b]	F	`EVEX.66.0F38 (4E/4F) /r`	`VRSQRT14PS z,z/m512`	`VRSQRT14SS x,x,x/m32`	`VRSQRT14PD z,z/m512`	`VRSQRT14SD x,x,x/m64`
Extract normalized mantissa from floating-point value. Scales the input number by a power of 2 such that the result ends up with its absolute-value within a specific normalization interval. Bits[1:0] of imm8 argument specifies normalization interval^[c] Bits[3:2] of imm8 argument specifies sign control^[d]	F	`EVEX.66.0F3A (26/27) /r ib`	`VGETMANTPS z,z/m512,imm8`	`VGETMANTSS x,x,x/m32,imm8`	`VGETMANTPD z,z/m512,imm8`	`VGETMANTSD x,x,x/m64,imm8`	SAE
Fix up special floating-point values. The first source argument provides floating-point values to perform fix-up on. Fix-up is done by first classifying the floating-point value into one of eight classes, resulting in an index in the range 0..7.^[e] The bottom 32 bits of the second argument is treated as a lookup table with 8 entries, each 4 bits wide; the classification index is used to look up a 4-bit entry from this lookup table. This 4-bit entry is then used to pick one of 16 response values for the floating-point value from the first argument.^[f] The result is then stored in destination register. (For the packed variants, the second source argument is interpreted as providing one lookup-table for each vector-lane; classification, table-lookup and response-value selection are done independently for each vector lane.) The imm8 argument is used to indicate floating-point exceptions that should be signalled for certain special values.^[g]	F	`EVEX.66.0F3A (54/55) /r ib`	`VFIXUPIMMPS z,z,z/m512,imm8`	`VFIXUPIMMSS x,x,x/m32,imm8`	`VFIXUPIMMPD z,z,z/m512,imm8`	`VFIXUPIMMSD x,x,x/m64,imm8`	SAE

Range Restriction Calculation. First compare the two source arguments and select one of them as specified by bits 1:0 of the imm8 argument,^[h] then modify the sign-bit of the selected source argument as specified by bits 3:2 of the imm8 argument.^[i]	DQ	`EVEX.66.0F3A (50/51) /r ib`	`VRANGEPS x,x,x/m128,imm8`	`VRANGESS x,x,x/m32,imm8`	`VRANGEPD x,x,x/m128,imm8`	`VRANGESD x,x,x/m64,imm8`	SAE
Reduction Transformation. Performs the operation ${\text{dst}}\leftarrow {\text{src1}}-\left({\text{ROUND}}\left(2^{M}src1\right)2^{-M}\right)$ ^[j] where $M$ is an unsigned 4-bit integer taken from bits 7:4 of the instruction's imm8 argument, and the rounding mode used for ${\text{ROUND}}()$ is taken from bits 2:0 of the imm8. (Bit 3 of the imm8 is used to suppress precision exceptions.)	DQ	`EVEX.66.0F3A (56/57) /r ib`	`VREDUCEPS x,x/m128,imm8`	`VREDUCESS x,x,x/m32,imm8`	`VREDUCEPD x,x/m128,imm8`	`VREDUCESD x,x,x/m64,imm8`	SAE
Floating-point classification test. imm8 specifies a set of floating-point number classes to test for as a bitmap.^[k] Result is written to opmask register — for each lane, if the input value belonged to any of the classes indicated by the imm8 argument, then a 1 is written to the corresponding opmask bit, else a 0 is written to that bit.	DQ	`EVEX.66.0F3A (66/67) /r ib`	`VFPCLASSPS k,x/m128,imm8`	`VFPCLASSSS k,x/m32,imm8`	`VFPCLASSPD k,x/m128,imm8`	`VFPCLASSSD k,x/m64,imm8`

[a]
For the VGETEXP* instructions:
If the $src1$ source argument is ±0, then the instruction returns $-\infty$ .

If the $src1$ source argument is $\pm \infty$ , then the instruction returns $+\infty$ .
[b]
The operation of the VRCP14* and VRSQRT14* approximation instructions is defined in a bit-exact manner − a C reference model is provided by Intel.^[12]

[c]

The normalization intervals supported for the VGETMANT* instructions are:

More information

...

Value	Normalization interval
0	$[1,2)$
1	$[0.5,2)$
2	$[0.5,1)$
3	$[0.75,1.5)$

For the

[0.5,2)

normalization interval, the input value will be scaled by a power of 4. If the input value is 0.0 or

\pm \infty

, then the result of VGETMANT* is 1.0, possibly modified by sign control.

[d]
For the VGETMANT* instructions, if bit 3 of its imm8 argument is set and the input value is negative ( $-0.0$ is not considered negative), then the result of VGETMANT* is converted to qNaN. Else if bit 2 is set, the sign bit of the result is cleared, else the input sign bit is kept.
[e]
The floating-point classes that the VFIXUPIMM* instructions will recognize are:
More information Number type, Classification index ...

Number type Classification
index

Quiet NaN 0

Signalling NaN 1

Zero (±0.0) 2

+1.0 3

Negative infinity 4

Positive infinity 5

Negative finite value 6

Positive finite value (other than +1.0) 7

Close
If MXCSR.DAZ is set, then denormals are classified as zero, else they are classified as positive/negative finite values.

Number type	Classification index
Quiet NaN	0
Signalling NaN	1
Zero (±0.0)	2
+1.0	3
Negative infinity	4
Positive infinity	5
Negative finite value	6
Positive finite value (other than +1.0)	7

[f]

The response values that can be selected by the VFIXUPIMM* instructions' 4-bit table lookup result are:

More information

...

Index	Response value
0	destination
1	source
2	QNaN based on source (all exponent-bits and top mantissa bit of source bitwise-ORed with 1s)
3	QNaN_Indefinite (sign, all exponent-bits and top mantissa bit set to 1s)
4	$+\infty$
5	$-\infty$
6	$\pm \infty$ , with same sign as source
7	$-0.0$
8	$+0.0$
9	$-1.0$
10	$+1.0$
11	$+0.5$
12	$+90.0$
13	$\pi /2$ (rounded to nearest)
14	MAX_FLOAT
15	-MAX_FLOAT

If MXCSR.DAZ is set, then any denormal source values (but not destination values) are flushed to zero before the response-value is selected. (This affects responses 1 and 2.)

[g]

The floating-point exceptions that can be indicated by the imm8 argument to the VFIXUPIMM* instruction are:

More information #ZE if

...

Bit	Exception to report
0	#ZE if $src1=0.0$
1	#IE if $src1=0.0$
2	#ZE if $src1=1.0$
3	#IE if $src1=1.0$
4	#IE if $src1=sNaN$
5	#IE if $src1=-\infty$
6	#IE if src1 is negative finite value
7	#IE if $src1=+\infty$

Denormal source values are treated as 0.0 if MXCSR.DAZ is set and as positive/negative finite values otherwise.

[h]

For the VRANGE* instructions, bits 1:0 of the instruction's imm8 argument will pick a comparison and selection to perform as follows:

More information Value, Meaning ...

Value	Meaning
0	Minimum value: src1 ≤ src2 ? src1 : src2
1	Maximum value: src1 ≤ src2 ? src2 : src1
2	Minimum absolute value: \|src1\| ≤ \|src2\| ? src1 : src2
3	Maximum absolute value: \|src1\| ≤ \|src2\| ? src2 : src1

If both source arguments are

\pm 0.0

, then for the purposes of this comparison and selection,

-0.0

is considered less than

+0.0

. If either input is

NaN

, then the result is

NaN

[i]
For the VRANGE* instructions, bits 3:2 of the imm8 are used to select the sign-bit of the result as follows:
More information Value, Meaning ...

Value Meaning

0 Use sign-bit from src1

1 Use sign-bit from selected value

2 Set sign-bit to 0

3 Set sign-bit to 1

Close
[j]
If the source argument to the VREDUCE* is $\pm \infty$ , then the result will be $+0.0$ .

Value	Meaning
0	Use sign-bit from src1
1	Use sign-bit from selected value
2	Set sign-bit to 0
3	Set sign-bit to 1

[k]

The imm8 argument to the VFPCLASS* instructions has the following layout:

More information

...

Bit	Classification to test
0	qNaN
1	$+0.0$
2	$-0.0$
3	$+\infty$
4	$-\infty$
5	Denormal
6	Negative finite
7	sNaN

If MXCSR.DAZ is set, then denormals are treated as ±0 for the classification test.

Opmask instructions

AVX-512 introduces, in addition to 512-bit vectors, a set of eight opmask registers, named k0,k1,k2...k7. These registers are 64 bits wide in implementations that support AVX512BW and 16 bits wide otherwise. They are mainly used to enable/disable operation on a per-lane basis for most of the AVX-512 vector instructions. They are usually set with vector-compare instructions or instructions that otherwise produce a 1-bit per-lane result as a natural part of their operation - however, AVX-512 defines a set of 55 new instructions to help assist manual manipulation of the opmask registers.

These instructions are, for the most part, defined in groups of 4 instructions, where the four instructions in a group are basically just 8-bit, 16-bit, 32-bit and 64-bit variants of the same basic operation (where only the low 8/16/32/64 bits of the registers participate in the given operation and, if a result is written back to a register, all bits except the bottom 8/16/32/64 bits are set to zero). The opmask instructions are all encoded with the VEX prefix (unlike all other AVX-512 instructions, which are encoded with the EVEX prefix).

In general, the 16-bit variants of the instructions are introduced by AVX512F (except KADDW and KTESTW), the 8-bit variants by the AVX512DQ extension, and the 32/64-bit variants by the AVX512BW extension.

Most of the instructions follow a very regular encoding pattern where the four instructions in a group have identical encodings except for the VEX.pp and VEX.W fields:

More information Instruction description, Basic opcode ...

Instruction description	Basic opcode	8-bit instructions (AVX512DQ) encoded with VEX.66.W0	16-bit instructions (AVX512F) encoded with VEX.NP.W0	32-bit instructions (AVX512BW) encoded with VEX.66.W1	64-bit instructions (AVX512BW) encoded with VEX.NP.W1
Bitwise AND between two opmask-registers	`VEX.L1.0F 41 /r`	`KANDB k,k,k`	`KANDW k,k,k`	`KANDD k,k,k`	`KANDQ k,k,k`
Bitwise AND-NOT between two opmask-registers	`VEX.L1.0F 42 /r`	`KANDNB k,k,k`	`KANDNW k,k,k`	`KANDND k,k,k`	`KANDNQ k,k,k`
Bitwise NOT of opmask-register	`VEX.L0.0F 44 /r`	`KNOTB k,k`	`KNOTW k,k`	`KNOTD k,k`	`KNOTQ k,k`
Bitwise OR of two opmask-registers	`VEX.L1.0F 45 /r`	`KORB k,k,k`	`KORW k,k,k`	`KORD k,k,k`	`KORQ k,k,k`
Bitwise XNOR of two opmask-registers	`VEX.L1.0F 46 /r`	`KXNORB k,k,k`	`KXNORW k,k,k`	`KXNORD k,k,k`	`KXNORQ k,k,k`
Bitwise XOR of two opmask-registers	`VEX.L1.0F 47 /r`	`KXORB k,k,k`	`KXORW k,k,k`	`KXORD k,k,k`	`KXORQ k,k,k`
Integer addition of two opmask-registers	`VEX.L1.0F 4A /r`	`KADDB k,k,k`	`KADDW k,k,k`^[a]	`KADDD k,k,k`	`KADDQ k,k,k`
Load opmask-register from memory or opmask-register	`VEX.L0.0F 90 /r`^[b]	`KMOVB k,k/m8`	`KMOVW k,k/m16`	`KMOVD k,k/m32`	`KMOVQ k,k/m64`
Store opmask-register to memory	`VEX.L0.0F 91 /r`^[b]	`KMOVB m8,k`	`KMOVW m16,k`	`KMOVD m32,k`	`KMOVQ m64,k`
Load opmask-register from general-purpose register	`VEX.L0.0F 92 /r`^[b]	`KMOVB k,r32`	`KMOVW k,r32`	No^[c]	No^[c]
Store opmask-register to general-purpose register with zero-extension	`VEX.L0.0F 93 /r`^[b]	`KMOVB r32,k`	`KMOVW r32,k`	No^[c]	No^[c]
Bitwise OR-and-test. Performs bitwise-OR between two opmask-registers and set flags accordingly. If the bitwise-OR resulted in all-0s, set ZF=1, else set ZF=0. If the bitwise-OR resulted in all-1s, set CF=1, else set CF=0.	`VEX.L0.0F 98 /r`	`KORTESTB k,k`	`KORTESTW k,k`	`KORTESTD k,k`	`KORTESTQ k,k`
Bitwise test. Performs bitwise-AND and ANDNOT between two opmask-registers and set flags accordingly. If the bitwise-AND resulted in all-0s, set ZF=1, else set ZF=0. If the bitwise AND between the inverted first operand and the second operand resulted in all-0s, set CF=1, else set CF=0.	`VEX.L0.0F 99 /r`	`KTESTB k,k`	`KTESTW k,k`^[a]	`KTESTD k,k`	`KTESTQ k,k`

[a]
The 16-bit opmask instructions KADDW and KTESTW were introduced with AVX512DQ, not AVX512F.
[b]
On processors that support Intel APX, all forms of the KMOV* instructions (but not any other opmask instructions) can be EVEX-encoded.
[c]
The 32/64-bit KMOVD/KMOVQ instructions to move between opmask-registers and general-purpose registers do exist, but do not match the pattern of the opcodes in this table. See table below.

Not all of the opmask instructions fit the pattern above - the remaining ones are:

More information Instruction description, Instruction mnemonics ...

Instruction description	Instruction mnemonics	Opcode	Operation/result width	AVX-512 subset
Opmask-register shift right immediate with zero-fill^[a]	`KSHIFTRB k,k,imm8`	`VEX.L0.66.0F3A.W0 30 /r /ib`	8	DQ
`KSHIFTRW k,k,imm8`	`VEX.L0.66.0F3A.W1 30 /r /ib`	16	F
`KSHIFTRD k,k,imm8`	`VEX.L0.66.0F3A.W0 31 /r /ib`	32	BW
`KSHIFTRQ k,k,imm8`	`VEX.L0.66.0F3A.W1 31 /r /ib`	64	BW
Opmask-register shift left immediate^[a]	`KSHIFTLB k,k,imm8`	`VEX.L0.66.0F3A.W0 32 /r /ib`	8	DQ
`KSHIFTLW k,k,imm8`	`VEX.L0.66.0F3A.W1 32 /r /ib`	16	F
`KSHIFTLD k,k,imm8`	`VEX.L0.66.0F3A.W0 33 /r /ib`	32	BW
`KSHIFTLQ k,k,imm8`	`VEX.L0.66.0F3A.W1 33 /r /ib`	64	BW
32/64-bit move between general-purpose registers and opmask registers	`KMOVD k,r32`	`VEX.L0.F2.0F.W0 92 /r`^[b]	32	BW
`KMOVQ k,r64`^[c]	`VEX.L0.F2.0F.W1 92 /r`^[b]	64	BW
`KMOVD r32,k`	`VEX.L0.F2.0F.W0 93 /r`^[b]	32	BW
`KMOVQ r64,k`^[c]	`VEX.L0.F2.0F.W1 93 /r`^[b]	64	BW
Concatenate two 8-bit opmasks into a 16-bit opmask	`KUNPCKBW k,k,k`	`VEX.L1.66.0F.W0 4B /r`	16	F
Concatenate two 16-bit opmasks into a 32-bit opmask	`KUNPCKWD k,k,k`	`VEX.L1.0F.W0 4B /r`	32	BW
Concatenate two 32-bit opmasks into a 64-bit opmask	`KUNPCKDQ k,k,k`	`VEX.L1.0F.W1 4B /r`	64	BW

[a]
For the KSHIFT* instructions, the imm8 shift-amount is not masked. Specifying a shift-amount greater than or equal to the operand size will produce an all-zeroes result.
[b]
On processors that support Intel APX, all forms of the KMOV* instructions (but not any other opmask instructions) can be EVEX-encoded.
[c]
KMOVQ instruction with 64-bit general-purpose register operand only available in x86-64 long mode. Instruction will execute as KMOVD in 32-bit mode.

Compare, test, blend, opmask-convert instructions

Vector-register instructions that use opmasks in ways other than just as a result writeback mask.

More information Description, Instruction mnemonics ...

Description		Instruction mnemonics	Opcode	AVX-512 subset	opmask lane-width	broadcast lane-width
Packed integer compare
Compare packed signed integers into opmask register	8-bit	`VPCMPB k,xmm,xmm/m128,imm8`	`EVEX.66.0F3A.W0 3F /r ib`	BW	8	No
	16-bit	`VPCMPW k,xmm,xmm/m128,imm8`	`EVEX.66.0F3A.W1 3F /r ib`	BW	16	No
	32-bit	`VPCMPD k,xmm,xmm/m128,imm8`	`EVEX.66.0F3A.W0 1F /r ib`	F	32	32
	64-bit	`VPCMPQ k,xmm,xmm/m128,imm8`	`EVEX.66.0F3A.W1 1F /r ib`	F	64	64
Compare packed unsigned integers into opmask register	8-bit	`VPCMPUB k,xmm,xmm/m128,imm8`	`EVEX.66.0F3A.W0 3E /r ib`	BW	8	No
	16-bit	`VPCMPUW k,xmm,xmm/m128,imm8`	`EVEX.66.0F3A.W1 3E /r ib`	BW	16	No
	32-bit	`VPCMPUD k,xmm,xmm/m128,imm8`	`EVEX.66.0F3A.W0 1E /r ib`	F	32	32
	64-bit	`VPCMPUQ k,xmm,xmm/m128,imm8`	`EVEX.66.0F3A.W1 1E /r ib`	F	64	64
Packed integer test
Perform bitwise-AND on packed integer values, then write zero/nonzero status of each integer result element into the corresponding opmask register bit (zero→0, nonzero→1)	8-bit	`VPTESTMB k,xmm,xmm/m128`	`EVEX.66.0F38.W0 26 /r`	BW	8	No
	16-bit	`VPTESTMW k,xmm,xmm/m128`	`EVEX.66.0F38.W1 26 /r`	BW	16	No
	32-bit	`VPTESTMD k,xmm,xmm/m128`	`EVEX.66.0F38.W0 27 /r`	F	32	32
	64-bit	`VPTESTMQ k,xmm,xmm/m128`	`EVEX.66.0F38.W1 27 /r`	F	64	64
Perform bitwise-AND on packed integer values, then write negated zero/nonzero status of each integer result element into the corresponding opmask register bit (zero→1, nonzero→0)	8-bit	`VPTESTNMB k,xmm,xmm/m128`	`EVEX.F3.0F38.W0 26 /r`	BW	8	No
	16-bit	`VPTESTNMW k,xmm,xmm/m128`	`EVEX.F3.0F38.W1 26 /r`	BW	16	No
	32-bit	`VPTESTNMD k,xmm,xmm/m128`	`EVEX.F3.0F38.W0 27 /r`	F	32	32
	64-bit	`VPTESTNMQ k,xmm,xmm/m128`	`EVEX.F3.0F38.W1 27 /r`	F	64	64
Packed blend
Variable blend packed integer values. For each lane of result, pick value from either second or third vector argument based on the opmask register.	8-bit	`VPBLENDMB xmm{k},xmm,xmm/m128`^[a]	`EVEX.66.0F38.W0 66 /r`	BW	No	No
	16-bit	`VPBLENDMW xmm{k},xmm,xmm/m128`^[a]	`EVEX.66.0F38.W1 66 /r`	BW	No	No
	32-bit	`VPBLENDMD xmm{k},xmm,xmm/m128`^[a]	`EVEX.66.0F38.W0 64 /r`	F	No	32
	64-bit	`VPBLENDMQ xmm{k},xmm,xmm/m128`^[a]	`EVEX.66.0F38.W1 64 /r`	F	No	64
Variable blend packed floating-point values. For each lane of result, pick value from either second or third vector argument based on the opmask register.	32-bit	`VBLENDMPS xmm{k},xmm,xmm/m128`^[a]	`EVEX.66.0F38.W0 65 /r`	F	No	32
	64-bit	`VBLENDMPD xmm{k},xmm,xmm/m128`^[a]	`EVEX.66.0F38.W1 65 /r`	F	No	64
Conversions between vector register and opmask register
Convert opmask register to vector register, with each vector lane set to 0 or all-1s based on corresponding opmask bit.	8-bit	`VPMOVM2B xmm,k`^[b]	`EVEX.F3.0F38.W0 28 /r`	BW	No	No
	16-bit	`VPMOVM2W xmm,k`^[b]	`EVEX.F3.0F38.W1 28 /r`	BW	No	No
	32-bit	`VPMOVM2D xmm,k`^[b]	`EVEX.F3.0F38.W0 38 /r`	DQ	No	No
	64-bit	`VPMOVM2Q xmm,k`^[b]	`EVEX.F3.0F38.W1 38 /r`	DQ	No	No
Convert vector register to opmask register, by picking the top bit of each vector register lane.	8-bit	`VPMOVB2M k,xmm`^[b]	`EVEX.F3.0F38.W0 29 /r`	BW	No	No
	16-bit	`VPMOVW2M k,xmm`^[b]	`EVEX.F3.0F38.W1 29 /r`	BW	No	No
	32-bit	`VPMOVD2M k,xmm`^[b]	`EVEX.F3.0F38.W0 39 /r`	DQ	No	No
	64-bit	`VPMOVQ2M k,xmm`^[b]	`EVEX.F3.0F38.W1 39 /r`	DQ	No	No

[a]
For the AVX-512 V(P)BLENDM* instructions, result write masking is not available - the EVEX-prefix opmask register argument that is normally used for write-masking with most other AVX-512 instructions is instead used for source selection.
[b]
The VPMOVM2* and VPMOV*2M instructions do not support result masking by EVEX-encoded opmask register, requiring EVEX.aaa=0. The opmask registers actually specified for these instructions is specified through the ModR/M byte.

Data conversion instructions

More information Instruction description, Instruction mnemonics ...

Instruction description		Instruction mnemonics	Opcode	AVX-512 subset	opmask lane-width	broadcast lane-width	rc/sae
Packed integer narrowing conversions
Convert packed integers to narrower integers, with unsigned saturation	16-bit → 8-bit	`VPMOVUSWB ymm/m256,zmm`	`EVEX.F3.0F38.W0 10 /r`	BW	8	No
	32-bit → 8-bit	`VPMOVUSDB xmm/m128,zmm`	`EVEX.F3.0F38.W0 11 /r`	F	8	No
	64-bit → 8-bit	`VPMOVUSQB xmm/m64,zmm`	`EVEX.F3.0F38.W0 12 /r`	F	8	No
	32-bit → 16-bit	`VPMOVUSDW ymm/m256,zmm`	`EVEX.F3.0F38.W0 13 /r`	F	16	No
	64-bit → 16-bit	`VPMOVUSQW xmm/m128,zmm`	`EVEX.F3.0F38.W0 14/r`	F	16	No
	64-bit → 32-bit	`VPMOVUSQD ymm/m256,zmm`	`EVEX.F3.0F38.W0 15 /r`	F	32	No
Convert packed integers to narrower integers, with signed saturation	16-bit → 8-bit	`VPMOVSWB ymm/m256,zmm`	`EVEX.F3.0F38.W0 20 /r`	BW	8	No
	32-bit → 8-bit	`VPMOVSDB xmm/m128,zmm`	`EVEX.F3.0F38.W0 21 /r`	F	8	No
	64-bit → 8-bit	`VPMOVSQB xmm/m64,zmm`	`EVEX.F3.0F38.W0 22 /r`	F	8	No
	32-bit → 16-bit	`VPMOVSDW ymm/m256,zmm`	`EVEX.F3.0F38.W0 23 /r`	F	16	No
	64-bit → 16-bit	`VPMOVSQW xmm/m128,zmm`	`EVEX.F3.0F38.W0 24 /r`	F	16	No
	64-bit → 32-bit	`VPMOVSQD ymm/m256,zmm`	`EVEX.F3.0F38.W0 25 /r`	F	32	No
Convert packed integers to narrower integers, with truncation	16-bit → 8-bit	`VPMOVWB ymm/m256,zmm`	`EVEX.F3.0F38.W0 30 /r`	BW	8	No
	32-bit → 8-bit	`VPMOVDB xmm/m128,zmm`	`EVEX.F3.0F38.W0 31 /r`	F	8	No
	64-bit → 8-bit	`VPMOVQB xmm/m64,zmm`	`EVEX.F3.0F38.W0 32 /r`	F	8	No
	32-bit → 16-bit	`VPMOVDW ymm/m256,zmm`	`EVEX.F3.0F38.W0 33 /r`	F	16	No
	64-bit → 16-bit	`VPMOVQW xmm/m128,zmm`	`EVEX.F3.0F38.W0 34 /r`	F	16	No
	64-bit → 32-bit	`VPMOVQD ymm/m256,zmm`	`EVEX.F3.0F38.W0 35 /r`	F	32	No
Packed conversions between floating-point and integer
Convert packed floating-point values to packed unsigned integers	FP32 → uint32	`VCVTPS2UDQ xmm,xmm/m128`	`EVEX.0F.W0 79 /r`	F	32	32	RC
	FP64 → uint32	`VCVTPD2UDQ xmm,xmm/m128`	`EVEX.0F.W1 79 /r`	F	32	64	RC
	FP32 → uint64	`VCVTPS2UQQ xmm,xmm/m64`	`EVEX.66.0F.W0 79 /r`	DQ	64	32	RC
	FP64 → uint64	`VCVTPD2UQQ xmm,xmm/m128`	`EVEX.66.0F.W1 79 /r`	DQ	64	64	RC
Convert packed floating-point values to packed signed integers	FP32 → int64	`VCVTPS2QQ xmm,xmm/m64`	`EVEX.66.0F.W0 7B /r`	DQ	64	32	RC
	FP64 → int64	`VCVTPD2QQ xmm,xmm/m128`	`EVEX.66.0F.W1 7B /r`	DQ	64	64	RC
Convert packed floating-point values to packed unsigned integers, with round-to-zero	FP32 → uint32	`VCVTTPS2UDQ xmm,xmm/m128`	`EVEX.0F.W0 78 /r`	F	32	32	SAE
	FP64 → uint32	`VCVTTPD2UDQ xmm,xmm/m128`	`EVEX.0F.W1 78 /r`	F	32	64	SAE
	FP32 → uint64	`VCVTTPS2UQQ xmm,xmm/m64`	`EVEX.66.0F.W0 78 /r`	DQ	64	32	SAE
	FP64 → uint64	`VCVTTPD2UQQ xmm,xmm/m128`	`EVEX.66.0F.W1 78 /r`	DQ	64	64	SAE
Convert packed floating-point values to packed signed integers, with round-to-zero	FP32 → int64	`VCVTTPS2QQ xmm,xmm/m64`	`EVEX.66.0F.W0 7A /r`	DQ	64	32	SAE
	FP64 → int64	`VCVTTPD2QQ xmm,xmm/m128`	`EVEX.66.0F.W1 7A /r`	DQ	64	64	SAE
Convert packed unsigned integers to floating-point	uint32 → FP32	`VCVTUDQ2PS xmm,xmm/m128`	`EVEX.F2.0F.W0 7A /r`	F	32	32	RC
	uint32 → FP64	`VCVTUDQ2PD xmm,xmm/m128`	`EVEX.F3.0F.W0 7A /r`	F	64	32	RC^[a]
	uint64 → FP32	`VCVTUQQ2PS xmm,xmm/m128`	`EVEX.F2.0F.W1 7A /r`	DQ	32	64	RC
	uint64 → FP64	`VCVTUQQ2PD xmm,xmm/m128`	`EVEX.F3.0F.W1 7A /r`	DQ	64	64	RC
Covert packed signed integers to floating-point	int64 → FP32	`VCVTQQ2PS xmm,xmm/m128`	`EVEX.0F.W1 5B /r`	DQ	32	64	RC
Covert packed signed integers to floating-point	int64 → FP64	`VCVTQQ2PD xmm,xmm/m128`	`EVEX.F3.0F.W1 E6 /r`	DQ	64	64	RC
Scalar conversions between floating-point and unsigned integer
Convert scalar floating-point value to unsigned integer, and store integer in GPR.	FP32 → uint32	`VCVTSS2USI r32,xmm/m32`	`EVEX.F3.0F.W0 79 /r`	F	No	No	RC
	FP32 → uint64	`VCVTSS2USI r64,xmm/m32`^[b]	`EVEX.F3.0F.W1 79 /r`	F	No	No	RC
	FP64 → uint32	`VCVTSD2USI r32,xmm/m64`	`EVEX.F2.0F.W0 79 /r`	F	No	No	RC
	FP64 → uint64	`VCVTSD2USI r64,xmm/m64`^[b]	`EVEX.F2.0F.W1 79 /r`	F	No	No	RC
Convert scalar floating-point value to unsigned integer with round-to-zero, and store integer in GPR.	FP32 → uint32	`VCVTTSS2USI r32,xmm/m32`	`EVEX.F3.0F.W0 78 /r`	F	No	No	SAE
	FP32 → uint64	`VCVTTSS2USI r64,xmm/m32`^[b]	`EVEX.F3.0F.W1 78 /r`	F	No	No	SAE
	FP64 → uint32	`VCVTTSD2USI r32,xmm/m64`	`EVEX.F2.0F.W0 78 /r`	F	No	No	SAE
	FP64 → uint32	`VCVTTSD2USI r64,xmm/m64`^[b]	`EVEX.F2.0F.W1 78 /r`	F	No	No	SAE
Convert scalar unsigned integer to floating-point	uint32 → FP32	`VCVTUSI2SS xmm,xmm,r/m32`	`EVEX.F3.0F.W0 7B /r`	F	No	No	RC
	uint64 → FP32	`VCVTUSI2SS xmm,xmm,r/m64`^[b]	`EVEX.F3.0F.W1 7B /r`	F	No	No	RC
	uint32 → FP64	`VCVTUSI2SD xmm,xmm,r/m32`	`EVEX.F2.0F.W0 7B /r`	F	No	No	RC^[a]
	uint64 → FP64	`VCVTUSI2SD xmm,xmm,r/m64`^[b]	`EVEX.F2.0F.W1 7B /r`	F	No	No	RC

[a]
For instructions that perform conversions from unsigned 32-bit integer to FP64 (VCVTUDQ2PD and the W=0 variant of VCVTUSI2SD), EVEX embedded rounding controls are permitted but have no effect.
[b]
Scalar conversions to/from 64-bit integer (VCVTSS2USI, VCVTSD2USI, VCVTTSS2USI, VCVTTSD2USI, VCVTUSI2SS, VCVTUSI2SD with EVEX.W=1 encoding) are only available in 64-bit "long mode". Otherwise, these instructions execute as if EVEX.W=0, resulting in 32-bit integer operation.

Data movement instructions

More information Instruction description, Instruction mnemonics ...

Instruction description		Instruction mnemonics	Opcode	AVX-512 subset	opmask lane-width	broadcast lane-width
Vector lane broadcast/extract/insert
Broadcast general-purpose register to all lanes of vector register^[a]	8-bit	`VPBROADCASTB xmm,r32`	`EVEX.66.0F38.W0 7A /r`	BW	8	No
	16-bit	`VPBROADCASTW xmm,r32`	`EVEX.66.0F38.W0 7B /r`	BW	16	No
	32-bit	`VPBROADCASTD xmm,r32`	`EVEX.66.0F38.W0 7C /r`	F	32	No
	64-bit	`VPBROADCASTQ xmm,r64`^[b]	`EVEX.66.0F38.W1 7C /r`	F	64	No
Broadcast vector of 256-bit floating-point data from memory to all 256-bit lanes of zmm-register	fp32	`VBROADCASTF32X8 zmm,m256`	`EVEX.66.0F38.W0 1B /r`	DQ	32	(256)^[c]
	fp64	`VBROADCASTF64X4 zmm,m256`	`EVEX.66.0F38.W0 1B /r`	F	64	(256)^[c]
Extract 256-bit vector-lane of floating-point data from wider vector-register	fp32	`VEXTRACTF32X8 ymm/m256,zmm`	`EVEX.66.0F3A.W0 1B /r ib`	DQ	32	No
	fp64	`VEXTRACTF64X4 ymm/m256,zmm`	`EVEX.66.0F3A.W1 1B /r ib`	F	64	No
Insert 256-bit vector of floating-point data into lane of wider vector	fp32	`VINSERTF32X8 zmm,zmm,ymm/m256,imm8`	`EVEX.66.0F3A.W0 1A /r ib`	DQ	32	No
	fp64	`VINSERTF64X4 zmm,zmm,ymm/m256,imm8`	`EVEX.66.0F3A.W1 1A /r ib`	F	64	No
Broadcast vector of 256-bit integer-point data from memory to all 256-bit lanes of zmm-register	i32	`VBROADCASTI32X8 zmm,m256`	`EVEX.66.0F38.W0 5B /r`	DQ	32	(256)^[c]
	i64	`VBROADCASTI64X4 zmm,m256`	`EVEX.66.0F38.W1 5B /r`	F	64	(256)^[c]
Extract 256-bit vector-lane of integer data from wider vector-register	i32	`VEXTRACTI32X8 ymm/m256,zmm`	`EVEX.66.0F3A.W0 3B /r ib`	DQ	32	No
	i64	`VEXTRACTI64X4 ymm/m256,zmm`	`EVEX.66.0F3A.W1 3B /r ib`	F	64	No
Insert 256-bit vector of integer data into lane of wider vector	i32	`VINSERTI32X8 zmm,zmm,ymm/m256,imm8`	`EVEX.66.0F3A.W0 3A /r ib`	DQ	32	No
	i64	`VINSERTI64X4 zmm,zmm,ymm/m256,imm8`	`EVEX.66.0F3A.W0 3A /r ib`	F	64	No
Vector compress/expand
Store sparse packed data items into a densely-packed array in vector register or memory. The data items to pack together are selected with an opmask register.	i32	`VPCOMPRESSD zmm1/m512{k1},zmm2`	`EVEX.66.0F38.W0 8B /r`	F	32	No
	i64	`VPCOMPRESSQ zmm1/m512{k1},zmm2`	`EVEX.66.0F38.W1 8B /r`	F	64	No
	fp32	`VCOMPRESSPS zmm1/m512{k1},zmm2`	`EVEX.66.0F38.W0 8A /r`	F	32	No
	fp64	`VCOMPRESSPD zmm1/m512{k1},zmm2`	`EVEX.66.0F38.W1 8A /r`	F	64	No
Load densely-packed data from vector register or memory, and store into sparse vector-elements of destination register. The data items to unpack into are selected with an opmask register.	i32	`VPEXPANDD zmm1{k1},zmm2/m512`	`EVEX.66.0F38.W0 89 /r`	F	32	No
	i64	`VPEXPANDQ zmm1{k1},zmm2/m512`	`EVEX.66.0F38.W1 89 /r`	F	64	No
	fp32	`VEXPANDPS zmm1{k1},zmm2/m512`	`EVEX.66.0F38.W0 88 /r`	F	32	No
	fp64	`VEXPANDPD zmm1{k1},zmm2/m512`	`EVEX.66.0F38.W1 88 /r`	F	64	No
Vector shuffle/permute
Permute 16-bit data-items from zmm3/m512 using 16-bit indexes from zmm2, then store result in zmm1.		`VPERMW zmm1,zmm2,zmm3/m512`	`EVEX.66.0F38.W1 8D /r`	BW	16	No
Concatenate first and third argument into a table the size of two vector registers (first argument being bottom half of table), then for each lane, perform an indexed lookup into the table using indexes from the second argument. The first argument (low half of table) is overwritten.	i16→i16	`VPERMT2W zmm1,zmm2,zmm3/m512`	`EVEX.66.0F38.W1 7D /r`	BW	16	No
	i32→i32	`VPERMT2D zmm1,zmm2,zmm3/m512`	`EVEX.66.0F38.W0 7E /r`	F	32	32
	i64→i64	`VPERMT2Q zmm1,zmm2,zmm3/m512`	`EVEX.66.0F38.W1 7E /r`	F	64	64
	i32→fp32	`VPERMT2PS zmm1,zmm2,zmm3/m512`	`EVEX.66.0F38.W0 7F /r`	F	32	32
	i64→fp64	`VPERMT2PD zmm1,zmm2,zmm3/m512`	`EVEX.66.0F38.W1 7F /r`	F	64	64
Concatenate second and third argument into a table the size of two vector registers (first argument being bottom half of table), then for each lane, perform an indexed lookup into the table using indexes from the first argument. The first argument (indexes) is overwritten.	i16→i16	`VPERMI2W zmm1,zmm2,zmm3/m512`	`EVEX.66.0F38.W1 75 /r`	BW	16	No
	i32→i32	`VPERMI2D zmm1,zmm2,zmm3/m512`	`EVEX.66.0F38.W0 76 /r`	F	32	32
	i64→i64	`VPERMI2Q zmm1,zmm2,zmm3/m512`	`EVEX.66.0F38.W1 76 /r`	F	32	32
	i32→fp32	`VPERMI2PS zmm1,zmm2,zmm3/m512`	`EVEX.66.0F38.W0 77 /r`	F	64	64
	i64→fp64	`VPERMI2PD zmm1,zmm2,zmm3/m512`	`EVEX.66.0F38.W0 77 /r`	F	64	64
Packed Interleaved Shuffle on 128-bit vector components. Performs a shuffle on each of its two input arguments, then keeps the bottom half of the shuffle result from its first argument and the top half of the shuffle result from its second argument. (Each half-result is one 128-bit component when using a vector size of 256 bits and two 128-bit components when using a vector size of 512 bits.) (For the purposes of masking and broadcast, the 128-bit components are themselves considered vectors of 32-bit or 64-bit items.)	4x fp32	`VSHUFF32x4 zmm1{k1},zmm2,zmm3/m512,imm8`	`EVEX.66.0F3A.W0 23 /r ib`	F	32	32
	2x fp64	`VSHUFF64x2 zmm1{k1},zmm2,zmm3/m512,imm8`	`EVEX.66.0F3A.W1 23 /r ib`	F	64	64
	4x int32	`VSHUFI32x4 zmm1{k1},zmm2,zmm3/m512,imm8`	`EVEX.66.0F3A.W0 43 /r ib`	F	32	32
	2x int64	`VSHUFI64x2 zmm1{k1},zmm2,zmm3/m512,imm8`	`EVEX.66.0F3A.W1 43 /r ib`	F	64	64
Vector scatter
Conditional vector memory scatter. For each 32/64-bit component of a given vector register, treat the component as a signed index for the x86 SIB `base+scale*index+displacement` address calculation, then conditionally store a 32/64-bit data item to the computed memory address. For each lane, whether it is stored to memory or not is controlled by the specified opmask register — if the corresponding bit in the specified opmask register is 1, then the data item is stored to memory and the bit is cleared, so that if any of the stores causes a memory exception, the instruction can be restarted without needing to redo earlier stores.	[s32]←i32	`VPSCATTERDD vm32z{k1},zmm1`	`EVEX.66.0F38.W0 A0 /r /vsib`	F	32	No
	[s32]←i64	`VPSCATTERDQ vm32y{k1},zmm1`	`EVEX.66.0F38.W1 A0 /r /vsib`	F	64	No
	[s64]←i32	`VPSCATTERQD vm64z{k1},ymm1`	`EVEX.66.0F38.W0 A1 /r /vsib`	F	32	No
	[s64]←i64	`VPSCATTERQQ vm64z{k1},zmm1`	`EVEX.66.0F38.W1 A1 /r /vsib`	F	64	No
	[s32]←f32	`VSCATTERDPS vm32z{k1},zmm1`	`EVEX.66.0F38.W0 A2 /r /vsib`	F	32	No
	[s32]←f64	`VSCATTERDPD vm32y{k1},zmm1`	`EVEX.66.0F38.W1 A2 /r /vsib`	F	64	No
	[s64]←f32	`VSCATTERQPS vm64z{k1},ymm1`	`EVEX.66.0F38.W0 A2 /r /vsib`	F	32	No
	[s64]←f64	`VSCATTERQPD vm64z{k1},zmm1`	`EVEX.66.0F38.W1 A2 /r /vsib`	F	64	No

[a]
Variants of the VPBROADCAST* instructions that can take their source from memory or an XMM register were introduced in AVX2.
[b]
VPBROADCASTQ with 64-bit register source operand is available only in 64-bit long mode. In 32-bit mode, the instruction will execute as if EVEX.W=0, resulting in 32-bit operation.
[c]
The V(P)BROADCAST* instructions perform broadcast as part of their normal operation - under AVX-512 with EVEX prefix, they do not need the EVEX.b modifier.

Shift/rotate instructions

More information Instruction description, Instruction mnemonics ...

Instruction description	Instruction mnemonics	Opcode	AVX-512 subset	opmask lane-width	broadcast lane-width
Packed left-rotate	32-bit	`VPROLD z,z/m512,imm8`	`EVEX.66.0F.W0 72 /1 ib`	F	32	32
64-bit	`VPROLQ z,z/m512,imm8`	`EVEX.66.0F.W1 72 /1 ib`	F	64	64
Packed left-rotate with per-lane rotate-amount	32-bit	`VPROLVD z,z,z/m512`	`EVEX.66.0F38.W0 15 /r`	F	32	32
64-bit	`VPROLVQ z,z,z/m512`	`EVEX.66.0F38.W1 15 /r`	F	64	64
Packed right-rotate	32-bit	`VPRORD z,z/m512,imm8`	`EVEX.66.0F.W0 72 /0 ib`	F	32	32
64-bit	`VPRORQ z,z/m512,imm8`	`EVEX.66.0F.W1 72 /0 ib`	F	64	64
Packed right-rotate with per-lane rotate-amount	32-bit	`VPRORVD z,z,z/m512`	`EVEX.66.0F38.W0 14 /r`	F	32	32
64-bit	`VPRORVQ z,z,z/m512`	`EVEX.66.0F38.W1 14 /r`	F	64	64

Packed arithmetic right-shift with shift-amount taken from either immediate or bottom lane of vector register	64-bit	`VPSRAQ z,z,x/m128`	`EVEX.66.0F.W1 E2 /r`	F	64	64
`VPSRAQ z,z/m512,imm8`	`EVEX.66.0F.W1 72 /4 ib`	F	64	64
Packed arithmetic right-shift with per-lane shift-amount	16-bit	`VPSRAVW z,z,z/m512`	`EVEX.66.0F38.W1 11 /r`	BW	16	No
64-bit	`VPSRAVQ z,z,z/m512`	`EVEX.66.0F38.W1 46 /r`	F	64	No
Packed logical left-shift with per-lane shift-amount	16-bit	`VPSLLVW z,z,z/m512`	`EVEX.66.0F38.W1 12 /r`	BW	16	No
Packed logical right-shift with per-lane shift-amount	16-bit	`VPSRLVW z,z,z/m512`	`EVEX.66.0F38.W1 10 /r`	BW	16	No

Concatenate two input vectors into a double-width vector, then right-shift that vector. The shift-amount is given as an immediate, in units of 32 or 64 bits.	32-bit	`VALIGND z,z,z/m512,imm8`	`EVEX.66.0F3A.W0 03 /r ib`	F	32	32
64-bit	`VALIGNQ z,z,z/m512,imm8`	`EVEX.66.0F3A.W1 03 /r ib`	F	64	64

Other AVX-512 foundation instructions

More information Perform floating-point rounding to a multiple of

...

Instruction description		Instruction mnemonics	Opcode	AVX-512 subset	opmask lane-width	broadcast lane-width	rc/sae
Three-argument bitwise logical operation on packed integers, with 8-bit truth table provided as 8-bit immediate.	32-bit	`VPTERNLOGD z,z,z/m512,imm8`	`EVEX.66.0F3A.W0 25 /r ib`	F	32	32	No
	64-bit	`VPTERNLOGQ z,z,z/m512,imm8`	`EVEX.66.0F3A.W1 25 /r ib`	F	64	64	No
Packed 64-bit integer multiply, keeping low 64 bits		`VPMULLQ z,z,z/m512`	`EVEX.66.0F38.W1 40 /r`	DQ	64	64	No
Double block packed sum-of-absolute-differences on unsigned bytes		`VDBPSADBW z,z,z/m512,imm8`	`EVEX.66.0F3A.W0 42 /r ib`	BW	16	No	No
Perform floating-point rounding to a multiple of $2^{-M}$ , where $M$ is taken from bits 7:4 of imm8 argument and can take values in the range 0..15. Bits 1:0 of the imm8 are used to specify rounding mode; if bit 2 is set, then the rounding mode is instead taken from the MXCSR.	fp32 packed	`VRNDSCALEPS z,z/m512,imm8`	`EVEX.66.0F3A.W0 08 /r ib`	F	32	32	SAE
	fp32 scalar	`VRNDSCALESS x,x,x/m32,imm8`	`EVEX.66.0F3A.W0 0A /r ib`	F	32	No	SAE
	fp64 packed	`VRNDSCALEPD z,z,z/m512,imm8`	`EVEX.66.0F3A.W1 09 /r ib`	F	64	64	SAE
	fp64 scalar	`VRNDSCALESD x,x,x/m64,imm8`	`EVEX.66.0F3A.W1 0B /r ib`	F	32	No	SAE

Remove ads

AMX

Summarize

Perspective

Intel AMX adds eight new tile-registers, tmm0-tmm7, each holding a matrix, with a maximum capacity of 16 rows of 64 bytes per tile-register. It also adds a TILECFG register to configure the sizes of the actual matrices held in each of the eight tile-registers, and a set of instructions to perform matrix multiplications on these registers.

More information AMX subset, Instruction mnemonics ...

AMX subset	Instruction mnemonics	Opcode	Instruction description	Added in

AMX-TILE AMX control and tile management.	`LDTILECFG m512`	`VEX.128.NP.0F38.W0 49 /0`	Load AMX tile configuration data structure from memory as a 64-byte data structure.	Sapphire Rapids
	`STTILECFG m512`	`VEX.128.66.0F38 W0 49 /0`	Store AMX tile configuration data structure to memory.
	`TILERELEASE`	`VEX.128.NP.0F38.W0 49 C0`	Initialize `TILECFG` and tile data registers (`tmm0` to `tmm7`) to the INIT state (all-zeroes).
	`TILEZERO tmm`	`VEX.128.F2.0F38.W0 49 /r`^[a]	Zero out contents of one tile register.
	`TILELOADD tmm, sibmem`	`VEX.128.F2.0F38.W0 4B /r`^[b]	Load a data tile from memory into AMX tile register.
	`TILELOADDT1 tmm, sibmem`	`VEX.128.66.0F38.W0 4B /r`^[b]	Load a data tile from memory into AMX tile register, with a hint that data should not be kept in the nearest cache levels.
	`TILESTORED mem, sibtmm`	`VEX.128.F3.0F38.W0 4B /r`^[b]	Store a data tile to memory from AMX tile register.

AMX-INT8 Matrix multiplication of tiles, with source data interpreted as 8-bit integers and destination data accumulated as 32-bit integers.	`TDPBSSD tmm1,tmm2,tmm3`^[c]	`VEX.128.F2.0F38.W0 5E /r`	Matrix multiply signed bytes from tmm2 with signed bytes from tmm3, accumulating result in tmm1.
	`TDPBSUD tmm1,tmm2,tmm3`^[c]	`VEX.128.F3.0F38.W0 5E /r`	Matrix multiply signed bytes from tmm2 with unsigned bytes from tmm3, accumulating result in tmm1.
	`TDPBUSD tmm1,tmm2,tmm3`^[c]	`VEX.128.66.0F38.W0 5E /r`	Matrix multiply unsigned bytes from tmm2 with signed bytes from tmm3, accumulating result in tmm1.
	`TDPBUUD tmm1,tmm2,tmm3`^[c]	`VEX.128.NP.0F38.W0 5E /r`	Matrix multiply unsigned bytes from tmm2 with unsigned bytes from tmm3, accumulating result in tmm1.

AMX-BF16 Matrix multiplication of tiles, with source data interpreted as bfloat16 values, and destination data accumulated as FP32 floating-point values.	`TDPBF16PS tmm1,tmm2,tmm3`^[c]	`VEX.128.F3.0F38.W0 5C /r`	Matrix multiply BF16 values from tmm2 with BF16 values from tmm3, accumulating result in tmm1.

AMX-FP16 Matrix multiplication of tiles, with source data interpreted as FP16 values, and destination data accumulated as FP32 floating-point values.	`TDPFP16PS tmm1,tmm2,tmm3`^[c]	`VEX.128.F2.0F38.W0 5C /r`	Matrix multiply FP16 values from tmm2 with FP16 values from tmm3, accumulating result in tmm1.	(Granite Rapids)

AMX-COMPLEX Matrix multiplication of tiles, with source data interpreted as complex numbers represented as pairs of FP16 values, and destination data accumulated as FP32 floating-point values.	`TCMMRLFP16PS tmm1,tmm2,tmm3`^[c]	`VEX.128.NP.0F38.W0 6C /r`	Matrix multiply complex numbers from tmm2 with complex numbers from tmm3, accumulating real part of result in tmm1.	(Granite Rapids D)
	`TCMMILFP16PS tmm1,tmm2,tmm3`^[c]	`VEX.128.66.0F38.W0 6C /r`	Matrix multiply complex numbers from tmm2 with complex numbers from tmm3, accumulating imaginary part of result in tmm1.	(Granite Rapids D)

[a]
For TILEZERO, the tile-register to clear is specified by bits 5:3 of the instruction's ModR/M byte. Bits 7:6 must be set to 11b, and bits 2:0 must be set to 000b.
[b]
For the TILELOADD, TILELOADDT1 and TILESTORED instructions, the memory argument must use a memory addressing mode with the SIB-byte. Under this addressing mode, the base register and displacement are used to specify the starting address for the first row of the tile to load/store from/to memory – the scale and index are used to specify a per-row stride.
These instructions are all interruptible – an interrupt or memory exception taken in the middle of these instructions will cause progress tracking information to be written to TILECFG.start_row, so that the instruction may continue on a partially-loaded/stored tile after the interruption.
[c]
For all of the AMX matrix multiply instructions, the three arguments are required to be three different tile registers, or else the instruction will #UD.

Remove ads

References

Loading content...

External links

Loading content...

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.

Remove ads

Summary of SIMD extensions

MMX instructions and extended variants thereof

Original Pentium MMX instructions, and SSE2/AVX/AVX-512 extended variants thereof

MMX instructions added with MMX+/SSE/SSE2/SSSE3, and SSE2/AVX/AVX-512 extended variants thereof

SSE instructions and extended variants thereof

Regularly-encoded floating-point SSE/SSE2 instructions, and AVX/AVX-512 extended variants thereof

Integer SSE2/4 instructions with 66h prefix, and AVX/AVX-512 extended variants thereof

Other SSE/2/3/4 SIMD instructions, and AVX/AVX-512 extended variants thereof

AVX/AVX2 instructions, and AVX-512 extended variants thereof

AVX instructions

AVX2 instructions

Other VEX-encoded SIMD instructions

FMA3 and FMA4 instructions

AVX-512

AVX-512 foundation, byte/word and doubleword/quadword instructions (F, BW and DQ subsets)

Regularly-encoded floating-point instructions

Opmask instructions

Compare, test, blend, opmask-convert instructions

Data conversion instructions

Data movement instructions

Shift/rotate instructions

Other AVX-512 foundation instructions

AMX

See also

References

External links