Main Navigation

Secondary Navigation

Page Contents

Contents

AVX-512 Subversions

The AVX-512 instruction set introduced with Intels Knights landing and Skylake processors are implemented in different flavors (see cpuinfo command). Here some deeper insight into these flavors is given.

AVX-512CD: Conflict Detection

The conflict detection instruction is available to Skylake and Kngiths landing processors. It operates on a vector of memory addresses and returns the set of vector lanes that contain elements equal to any other elements in this vector. This instruction allows the vectorization of codes with indirect memory accesses leading to write conflicts.

For older instruction sets the specified program in Listing 1 must be executed sequentially, i.e. one value of i after the other because B[i] might have the same value for different values of i. For AVX-512CD vectorization is possible in the steps specified in Listing 2.

Listing 1: Small program with conflicts.
First a block of values from B is loaded into a vector register (vmovups Line 1 in Block B1.2) and then a conflict detection instruction (vpconflictd Line 2) identifies entries with identical values. If there are no conflicts the code jumps to block B1.6. Possible values in conflict are handled in a loop in block B1.4. For non conflicting values corresponding values of C are gathered (vgatherdps Line 8) and the resulting vector A scattered to memory (vscatterdps line 10).
Listing 2: Lines from assembly code for Listing 1.
This program is compiled with
icc -xCOMMON-AVX512 -S -qopt-report=5 indir.c
on a Skylake system. Listing 2 provides only the major steps in the assembly code. The generated optimization report contains the following lines:
irregularly indexed store was generated for the variable 
LOOP WAS VECTORIZED
--- begin vector cost summary ---
scalar cost: 10
vector cost: 9.780
estimated potential speedup: 1.020
--- end vector cost summary ---
Thus for the program in Listing 1 the best to be seen is a tiny speedup. Thus this feature is in practice meaningful for larger loops there conflict detection allows the vectorization of the complete loop with otherwise would be executed sequentially.

AVX-512F: Foundation

The AVX-512F instruction set is fundamental to the vectorization possibilities. It contains several features described below.

Masking

With this instruction set a dedicated bitmask register is available, the purpose of which is an effective vectorization of loops with if-statements like the one in Listing 3.
Listing 3: Small program with if-statement in loop.
Compiled with the options given above this time a potential speedup of up to 13 is predicted indicating a significant performance boost. The implication of this instruction set is again best seen in the assembly output which is abbreviated given in Listing 4.
Listing 4: Abbreviated assembly listing for Listing 3.
First B is loaded into a ZMM register (lines 1 and 2) afterwards A (lines 3 and 4). The instruction vcmpps evaluates the if-statement into mask registers (k2 and k4) and their logical AND results (knotw in lines 7 and 8). The registers containing B are copied according to the AND-masks (lines 9 and 10) and added to A (vaddps). The other B values are multiplied with corresponding values accordking to the mask registers k2 and k4 (vmulps lines 13 and 14) before the registers containing the results are stored completely (vmovups). The loop is unrolled with a factor of 2, therefore all instructions appear twice.

Compress/Expand

AVX-512F adds compress and expand instructions to AVX2. vcompress is used in the following example:
Listing 5: Compression of array B into array A.
Listing 6: Corresponding assembly snibbets.
First B is loaded (vmovups and compared into a mask register (k1). Guided by this mask the valid values of B are compressed into another ZMM register (vcompressps and the contents finally written to memory. The predicted speedup amounts to a factor of 18 and the loop will be unrolled to a factor of 4.

Permutation or Shuffle

This instruction allows to rearrange elements in one or two source registers and write them to the destination register. They beginn with vperm in the assembly code and may be used for a transposition, i.e.
A[i][j] = B[j][i]
There effectivity is mediocre and the example assembly code looks too complicated to be presented here.

Gather/Scatter

To vectorize non-unit strided access to arrays, like in
for(i=0;i<SIZE/INC;i++) A[i*INC] += C[i];
AVX-512 provides a gather and a scatter operation. Its use is depending on whether INC is known at compile time, and the size of INC. For INC=2 the compiler prefers to work on A with a series of shuffle and expand operations (see Listing 7). A stride of 2 puts 2 words of A onto one cache line. Guided by a mask register 8 floats are read from 4 cache lines into one ZMM registers which is half filled. 2 half filled ZMM registers are combined (line 1, vpermt2ps and the operands C are fetched from memory and added (line 2, vaddps). Half of the values are expanded into a new register (vexpandps to ZMM6) and the other halve permuted into another register (ZMM8) and then both registers stored to 4 cache lines, each. Important to notice is the large number of registers required in that case. Neglecting the vector masks ZMM1 and ZMM0 additional 5 registers are needed (2, 4, 7, 6, and 8).

1 vpermt2ps %zmm2, %zmm1, %zmm4
2 vaddps 32000(%rsp,%rax,4), %zmm4, %zmm7
3 vexpandps %zmm7, %zmm6{%k1}{z}
4 vpermps %zmm7, %zmm0, %zmm8
1 vgatherdps (%rax,%zmm0,4), %zmm1{%k1}
2 vaddps 32000(%rsp,%rdx,4), %zmm1, %zmm3
3 vscatterdps %zmm3, (%rax,%zmm0,4){%k3}
Listing 7: Shuffle and expandListing 8: Gather/Scatter
Still the compiler decides that this is more effective than the code generated in Listing 8 using Gather/Scatter operations in AVX-512F. Here data is first gathered direct from cache into one register (ZMM1, line 1) then 16 32-bit floats added (line 2) and according to the information in ZMM0 scattered to memory (line 3, vscatterdps). In this case a total of just 3 vector registers are involved (0, 1, and 3) which opens the possibility for handling more complicated loops with vectorization.

Embedded Broadcasting

Consider a common matrix-matrix multiplication:
for(i=0;i<SIZE;i++)
 for(j=0;j<SIZE;j++)
  for(k=0;k<SIZE;k++)
    A[i*SIZE+k] += B[i*SIZE+j] * C[j*SIZE+k];
Written that way, the inmost loop actually performs an update of one line of A there B[i*SIZE+j] is constant for the whole loop. With its embedded instruction the constant B[i*SIZE+j] is loaded once and broadcasted to fill register ZMM0 (line 1) before the registers ZMM1 to ZMM4 are filled with 16 values of A each (lines 2-5, vmovups). This indicates an unroll factor of 4. Finally C is loaded, multiplied with B in ZMM0, added and stored to ZMM-registers 1-4 in a single fused multiply and add instruction (vfmadd213ps).
Listing 9: Corresponding assembly snibbets to matrix-matrix multiplication.

The topics "Ternary Logic" and "Embeded Rounding" are not covered due to there very limited usability.

AVX-512DQ: Double and Quad Words

Instraction available for Skylake CPUs only to convert integers to floating point numbers in 512-bit vector registers. Not available for Knights Landing.

AVX-512BW: Byte and Word Support

Instruction to support applications with byte- or word-size data elements. Not available for Knights Landing.

AVX-512VL

This feature makes most of the 512-bit instructions to operate as well on 128-bit (XMM) and 256-bit (YMM). This feature is important for applications there the amount of data parallism cannot fill the 512-bit vector registers. Not available for Knights Landing.

AVX-512ER: Exponential and Reciprocal

This feature is available for Knights Landing only. It adds vector versions for base-2 exponential, reciprocal and reciprocal square root at high accuracy.

AVX-512PF: Prefetch for Gather/Scatter

This feature is available for Knights Landing only. It adds the possibility to prefetch up to 16 elements of scattered data in memory with a single instrcution and is especially important for code using indirect adressing and enhances the Gather/Scatter capabilities in AVX-512F further.