Contents
AVX-512 Subversions
The AVX-512 instruction set introduced with Intels Knights landing and Skylake processors are implemented in different flavors (see cpuinfo command). Here some deeper insight into these flavors is given.AVX-512CD: Conflict Detection
The conflict detection instruction is available to Skylake and Kngiths landing processors. It operates on a vector of memory addresses and returns the set of vector lanes that contain elements equal to any other elements in this vector. This instruction allows the vectorization of codes with indirect memory accesses leading to write conflicts.
For older instruction sets the specified program in Listing 1 must be
executed sequentially, i.e. one value of i after the
other because B[i] might have the same value for
different values of i. For AVX-512CD vectorization
is possible in the steps specified in Listing 2.
irregularly indexed store was generated for the variable LOOP WAS VECTORIZED --- begin vector cost summary --- scalar cost: 10 vector cost: 9.780 estimated potential speedup: 1.020 --- end vector cost summary ---Thus for the program in Listing 1 the best to be seen is a tiny speedup. Thus this feature is in practice meaningful for larger loops there conflict detection allows the vectorization of the complete loop with otherwise would be executed sequentially.
AVX-512F: Foundation
The AVX-512F instruction set is fundamental to the vectorization possibilities. It contains several features described below.Masking
With this instruction set a dedicated bitmask register is available, the purpose of which is an effective vectorization of loops with if-statements like the one in Listing 3.Compress/Expand
AVX-512F adds compress and expand instructions to AVX2. vcompress is used in the following example:
|
|
Permutation or Shuffle
This instruction allows to rearrange elements in one or two source registers and write them to the destination register. They beginn with vperm in the assembly code and may be used for a transposition, i.e.A[i][j] = B[j][i]
There effectivity is mediocre and the example assembly code looks too complicated to be presented here.
Gather/Scatter
To vectorize non-unit strided access to arrays, like infor(i=0;i<SIZE/INC;i++) A[i*INC] += C[i];
AVX-512 provides a gather and a scatter operation. Its use is depending on whether INC is known at compile time, and the size of INC. For INC=2 the compiler prefers to work on A with a series of shuffle and expand operations (see Listing 7). A stride of 2 puts 2 words of A onto one cache line. Guided by a mask register 8 floats are read from 4 cache lines into one ZMM registers which is half filled. 2 half filled ZMM registers are combined (line 1, vpermt2ps and the operands C are fetched from memory and added (line 2, vaddps). Half of the values are expanded into a new register (vexpandps to ZMM6) and the other halve permuted into another register (ZMM8) and then both registers stored to 4 cache lines, each. Important to notice is the large number of registers required in that case. Neglecting the vector masks ZMM1 and ZMM0 additional 5 registers are needed (2, 4, 7, 6, and 8).
1 vpermt2ps %zmm2, %zmm1, %zmm4 2 vaddps 32000(%rsp,%rax,4), %zmm4, %zmm7 3 vexpandps %zmm7, %zmm6{%k1}{z} 4 vpermps %zmm7, %zmm0, %zmm8 |
1 vgatherdps (%rax,%zmm0,4), %zmm1{%k1} 2 vaddps 32000(%rsp,%rdx,4), %zmm1, %zmm3 3 vscatterdps %zmm3, (%rax,%zmm0,4){%k3} |
Listing 7: Shuffle and expand | Listing 8: Gather/Scatter |
---|
Embedded Broadcasting
Consider a common matrix-matrix multiplication:for(i=0;i<SIZE;i++)
for(j=0;j<SIZE;j++)
for(k=0;k<SIZE;k++)
A[i*SIZE+k] += B[i*SIZE+j] * C[j*SIZE+k];
Written that way, the inmost loop actually performs an update of one line of A there B[i*SIZE+j] is constant for the whole loop. With its embedded instruction the constant B[i*SIZE+j] is loaded once and broadcasted to fill register ZMM0 (line 1) before the registers ZMM1 to ZMM4 are filled with 16 values of A each (lines 2-5, vmovups). This indicates an unroll factor of 4. Finally C is loaded, multiplied with B in ZMM0, added and stored to ZMM-registers 1-4 in a single fused multiply and add instruction (vfmadd213ps).
The topics "Ternary Logic" and "Embeded Rounding" are not covered due to there very limited usability.