Contents
Guidelines for Vectorization
The goal of including the vectorizer component in the Compiler is to exploit single-instruction multiple data (SIMD) processing automatically. Users can help by supplying the compiler with additional information; for example,- by using compiler options for optimization,
- by using auto-vectorizer hints or pragmas.
First of all follow these very basic guidelines to vectorize innermost loop bodies:
Use:
- only assignments statements
- vector statements, that is assignments with arrays.
- function calls except math library calls
- mixing vectorizable types in the same loop to avoid degraded performance
- data-dependent loop exit conditions
- non vectorizable operations.
Writing Vectorizable Code
Use simple for loops. Avoid complex loop termination conditions – the upper iteration limit must be invariant within the loop. For the innermost loop in a nest of loops, you could set the upper limit iteration to be a function of the outer loop indices.Good | Poor |
for(i=0;i<N;i++) |
while(i) |
for(i=0;i<N;i++) for(j=0;j<i;j++) | for(;;) { if(condition) break; } |
Write straight-line code. Avoid branches such as switch, goto, or return statements within loops.
Good | Poor |
if(condition) { for(i=0;i<N;i++) code_condition_1 } else { for(i=0;i<N;i++) code_condition_2 } |
for(i=0;i<N;i++) if(condition) { code_condition_1 } else { code_condition_2 } |
Good | Poor |
for(i=0;i<N;i++) A[i]=A[i-1] + .. |
Good | Poor |
float *mat; for(i=0;i<rows;i++) for(j=0;j<cols;j++) mat[(i)*cols+j]=mat[(i-1)*cols+j] + .. |
float **mat; for(i=0;i<rows;i++) for(j=0;j<cols;j++) mat[i][j]=mat[i-1][j] + .. compiler does not know whether mat[i] and mat[i-1] are the same |
Good | Poor |
for(i=0;x[i]!=0.;) { x=a[i]; i++; } |
Good | Poor |
for(i=0;i<N-1;i+=2) { x[i]=1.0f0; x[i+1]=-1.0f0; } |
for(i=0;i<N;i+=2) x[i]=1.; for(i=1;i<N;i+=2) x[i]=-1.; for(i=0;i<N;i++) y[i]=x[ind[i]]; |
Good | Poor |
struct SOA { double x[MAX],y[MAX]; char c[MAX];} SOA soa; for(i=0;i<MAX;i+) SOA.x[i]=0.0f0; Stride 1 access in memory. |
struct AOS { double x,y; char c;} AOS aos[MAX]; for(i=0;i<N;i++) AOS[i].x=0.0f0; Stride 3 access in memory. |
Using Aligned Data Structures
Data structure alignment is the adjustment of any data object with relation to other objects. The compiler may align individual variables to start at certain addresses to speed up memory access. Misaligned memory accesses can incur large performance losses on certain target processors that do not support them in hardware.Alignment is a property of a memory address, expressed as the numeric address modulo of powers of two. In addition to its address, a single datum also has a size. A datum is called 'naturally aligned' if its address is aligned to its size, otherwise it is called 'misaligned'. For example, an 8-byte floating-point datum is naturally aligned if the address used to identify it is aligned to eight (8).
A data structure is a way of storing data in a computer so that it can be used efficiently. Often, a carefully chosen data structure allows a more efficient algorithm to be used.
In the case of vector registers, their size refers to the required alignment. Thus a AVX 512 vector operations requires the operands to start at a 64 Byte alignment.
What happens if an array is misaligned? We consider the following
example:
float a[N];
for(i=0;i<N;i++)
a[i]=1.;
The natural alignment of array a is eight (8) that
is, the address of a[0] modulo 8 is 0 but not modulo 64.
The
compiler may vectorize this loop, but the vector register zmm for AVX 512 can not start to load memory at that address
but only at a 64 Byte alignment.
Therefore the compiler splits this loop at least into 2 pieces according
to (pseudo code)
for(i=0,i_al=0; (&a[i]%64!=0);i++) {
a[i]=1.;
i_al=i+1;}
for(i=i_al;i<N;i++)
a[i]=1.;
Using the Intel compiler you may verify this from
the following output in the optimization report:
LOOP BEGIN at source.c(16,1) <Peeled loop for vectorization> remark #15301: PEEL LOOP WAS VECTORIZED LOOP END LOOP BEGIN at source.c(16,1) remark #15300: LOOP WAS VECTORIZED LOOP END LOOP BEGIN at alignment.c(16,1) <Remainder loop for vectorization> remark #15301: REMAINDER LOOP WAS VECTORIZED LOOP ENDFirst thing to notice is that there exists three different reports for the same loop starting at line 16 at position 1. The first refers to the extra loop to tackle misaligned data as specified above. The second refers to the loop itself starting at an aligned adress. Vector registers work best if completely filled but the loop length is not necessarily a multiply of the vector register length. For better performance the compiler therefore generates a third loop there the remaining elements are treated. If we are able to align arrays within a loop the first loop will never be used thus improving performance (s. alignment).