Contents
AVX-512 Subversions
The AVX-512 instruction set introduced with Intels Knights landing and Skylake processors are implemented in different flavors (see cpuinfo command). Here some deeper insight into these flavors is given.AVX-512CD: Conflict Detection
The conflict detection instruction is available to Skylake and Kngiths landing processors. It operates on a vector of memory addresses and returns the set of vector lanes that contain elements equal to any other elements in this vector. This instruction allows the vectorization of codes with indirect memory accesses leading to write conflicts.For older instruction sets the specified program in Listing 1 must be executed sequentially, i.e. one value of i after the other because B[i] might have the same value for different values of i. For AVX-512CD vectorization is possible in the steps specified in Listing 2. First a block of values from B is loaded into a vector register (vmovups Line 1 in Block B1.2) and then a conflict detection instruction (vpconflictd Line 2) identifies entries with identical values. If there are no conflicts the code jumps to block B1.6. Possible values in conflict are handled in a loop in block B1.4. For non conflicting values corresponding values of C are gathered (vgatherdps Line 8) and the resulting vector A scattered to memory (vscatterdps line 10). This program is compiled with
irregularly indexed store was generated for the variable LOOP WAS VECTORIZED --- begin vector cost summary --- scalar cost: 10 vector cost: 9.780 estimated potential speedup: 1.020 --- end vector cost summary ---Thus for the program in Listing 1 the best to be seen is a tiny speedup. Thus this feature is in practice meaningful for larger loops there conflict detection allows the vectorization of the complete loop with otherwise would be executed sequentially.
AVX-512F: Foundation
The AVX-512F instruction set is fundamental to the vectorization possibilities. It contains several features described below.Masking
With this instruction set a dedicated bitmask register is available, the purpose of which is an effective vectorization of loops with if-statements like the one in Listing 3. Compiled with the options given above this time a potential speedup of up to 13 is predicted indicating a significant performance boost. The implication of this instruction set is again best seen in the assembly output which is abbreviated given in Listing 4. First B is loaded into a ZMM register (lines 1 and 2) afterwards A (lines 3 and 4). The instruction vcmpps evaluates the if-statement into mask registers (k2 and k4) and their logical AND results (knotw in lines 7 and 8). The registers containing B are copied according to the AND-masks (lines 9 and 10) and added to A (vaddps). The other B values are multiplied with corresponding values accordking to the mask registers k2 and k4 (vmulps lines 13 and 14) before the registers containing the results are stored completely (vmovups). The loop is unrolled with a factor of 2, therefore all instructions appear twice.Compress/Expand
AVX-512F adds compress and expand instructions to AVX2. vcompress is used in the following example:Permutation or Shuffle
This instruction allows to rearrange elements in one or two source registers and write them to the destination register. They beginn with vperm in the assembly code and may be used for a transposition, i.e.A[i][j] = B[j][i]
There effectivity is mediocre and the example assembly code looks too complicated to be presented here.
Gather/Scatter
To vectorize non-unit strided access to arrays, like infor(i=0;i<SIZE/INC;i++) A[i*INC] += C[i];
AVX-512 provides a gather and a scatter operation. Its use is depending on whether INC is known at compile time, and the size of INC. For INC=2 the compiler prefers to work on A with a series of shuffle and expand operations (see Listing 7). A stride of 2 puts 2 words of A onto one cache line. Guided by a mask register 8 floats are read from 4 cache lines into one ZMM registers which is half filled. 2 half filled ZMM registers are combined (line 1, vpermt2ps and the operands C are fetched from memory and added (line 2, vaddps). Half of the values are expanded into a new register (vexpandps to ZMM6) and the other halve permuted into another register (ZMM8) and then both registers stored to 4 cache lines, each. Important to notice is the large number of registers required in that case. Neglecting the vector masks ZMM1 and ZMM0 additional 5 registers are needed (2, 4, 7, 6, and 8).
1 vpermt2ps %zmm2, %zmm1, %zmm4 2 vaddps 32000(%rsp,%rax,4), %zmm4, %zmm7 3 vexpandps %zmm7, %zmm6{%k1}{z} 4 vpermps %zmm7, %zmm0, %zmm8 |
1 vgatherdps (%rax,%zmm0,4), %zmm1{%k1} 2 vaddps 32000(%rsp,%rdx,4), %zmm1, %zmm3 3 vscatterdps %zmm3, (%rax,%zmm0,4){%k3} |
Listing 7: Shuffle and expand | Listing 8: Gather/Scatter |
---|
Embedded Broadcasting
Consider a common matrix-matrix multiplication:for(i=0;i<SIZE;i++)
for(j=0;j<SIZE;j++)
for(k=0;k<SIZE;k++)
A[i*SIZE+k] += B[i*SIZE+j] * C[j*SIZE+k];
Written that way, the inmost loop actually performs an update of one line of A there B[i*SIZE+j] is constant for the whole loop. With its embedded instruction the constant B[i*SIZE+j] is loaded once and broadcasted to fill register ZMM0 (line 1) before the registers ZMM1 to ZMM4 are filled with 16 values of A each (lines 2-5, vmovups). This indicates an unroll factor of 4. Finally C is loaded, multiplied with B in ZMM0, added and stored to ZMM-registers 1-4 in a single fused multiply and add instruction (vfmadd213ps).
The topics "Ternary Logic" and "Embeded Rounding" are not covered due to there very limited usability.