Guidelines for Vectorization

The goal of including the vectorizer component in the Compiler is to exploit single-instruction multiple data (SIMD) processing automatically. Users can help by supplying the compiler with additional information; for example,

by using compiler options for optimization,
by using auto-vectorizer hints or pragmas.

First of all follow these very basic guidelines to vectorize innermost loop bodies:

Use:

only assignments statements
vector statements, that is assignments with arrays.

Avoid:

function calls except math library calls
mixing vectorizable types in the same loop to avoid degraded performance
data-dependent loop exit conditions
non vectorizable operations.

Sometimes you have to need to introduce changes to loops. You should thereby avoid to perform loop unrolling, which the compiler performs automatically, and decomposing one loop with several statements into several loops with a single statement.

Writing Vectorizable Code

Use simple for loops. Avoid complex loop termination conditions – the upper iteration limit must be invariant within the loop. For the innermost loop in a nest of loops, you could set the upper limit iteration to be a function of the outer loop indices.

Good	Poor
for(i=0;i<N;i++)	while(i)
for(i=0;i<N;i++) for(j=0;j<i;j++)	for(;;) { if(condition) break; }

Write straight-line code. Avoid branches such as switch, goto, or return statements within loops.

Good	Poor
if(condition) { for(i=0;i<N;i++) code_condition_1 } else { for(i=0;i<N;i++) code_condition_2 }	for(i=0;i<N;i++) if(condition) { code_condition_1 } else { code_condition_2 }

Avoid dependencies between loop iterations that prevent vectorization.

Good	Poor
	for(i=0;i<N;i++) A[i]=A[i-1] + ..

Try to use array notations instead of the use of pointers. C programs in particular impose very few restrictions on the use of pointers; aliased pointers may lead to unexpected dependencies. Without help, the compiler often cannot tell whether it is safe to vectorize code containing pointers.

Good	Poor
float mat; for(i=0;i<rows;i++) for(j=0;j<cols;j++) mat[(i)cols+j]=mat[(i-1)*cols+j] + ..	float **mat; for(i=0;i<rows;i++) for(j=0;j<cols;j++) mat[i][j]=mat[i-1][j] + .. compiler does not know whether mat[i] and mat[i-1] are the same

Wherever possible, use the loop index directly in array subscripts instead of incrementing a separate counter for use as an array address.

Good	Poor
	for(i=0;x[i]!=0.;) { x=a[i]; i++; }

Access memory efficiently, that is, that is, favor inner loops with unit stride, minimize indirect addressing and align your data to 16-byte boundaries (see below).

Good	Poor
for(i=0;i<N-1;i+=2) { x[i]=1.0f0; x[i+1]=-1.0f0; }	for(i=0;i<N;i+=2) x[i]=1.; for(i=1;i<N;i+=2) x[i]=-1.; for(i=0;i<N;i++) y[i]=x[ind[i]];

Use structure of arrays (SoA) instead of array of structures (AoS): An array is the most common type of data structure that contains a contiguous collection of data items that can be accessed by an ordinal index. You can organize this data as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization is excellent for encapsulation it can be a hindrance for use of vector processing.

Good	Poor
struct SOA { double x[MAX],y[MAX]; char c[MAX];} SOA soa; for(i=0;i<MAX;i+) SOA.x[i]=0.0f0; Stride 1 access in memory.	struct AOS { double x,y; char c;} AOS aos[MAX]; for(i=0;i<N;i++) AOS[i].x=0.0f0; Stride 3 access in memory.

Using Aligned Data Structures

Data structure alignment is the adjustment of any data object with relation to other objects. The compiler may align individual variables to start at certain addresses to speed up memory access. Misaligned memory accesses can incur large performance losses on certain target processors that do not support them in hardware.

Alignment is a property of a memory address, expressed as the numeric address modulo of powers of two. In addition to its address, a single datum also has a size. A datum is called 'naturally aligned' if its address is aligned to its size, otherwise it is called 'misaligned'. For example, an 8-byte floating-point datum is naturally aligned if the address used to identify it is aligned to eight (8).

A data structure is a way of storing data in a computer so that it can be used efficiently. Often, a carefully chosen data structure allows a more efficient algorithm to be used.

In the case of vector registers, their size refers to the required alignment. Thus a AVX 512 vector operations requires the operands to start at a 64 Byte alignment.

What happens if an array is misaligned? We consider the following example:
float a[N]; for(i=0;i<N;i++) a[i]=1.; The natural alignment of array a is eight (8) that is, the address of a[0] modulo 8 is 0 but not modulo 64. The compiler may vectorize this loop, but the vector register zmm for AVX 512 can not start to load memory at that address but only at a 64 Byte alignment. Therefore the compiler splits this loop at least into 2 pieces according to (pseudo code)
for(i=0,i_al=0; (&a[i]%64!=0);i++) { a[i]=1.; i_al=i+1;} for(i=i_al;i<N;i++) a[i]=1.; Using the Intel compiler you may verify this from the following output in the optimization report:

LOOP BEGIN at source.c(16,1)
<Peeled loop for vectorization>
   remark #15301: PEEL LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at source.c(16,1)
   remark #15300: LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at alignment.c(16,1)
<Remainder loop for vectorization>
   remark #15301: REMAINDER LOOP WAS VECTORIZED
LOOP END

First thing to notice is that there exists three different reports for the same loop starting at line 16 at position 1. The first refers to the extra loop to tackle misaligned data as specified above. The second refers to the loop itself starting at an aligned adress. Vector registers work best if completely filled but the loop length is not necessarily a multiply of the vector register length. For better performance the compiler therefore generates a third loop there the remaining elements are treated. If we are able to align arrays within a loop the first loop will never be used thus improving performance (s. alignment).

Main Navigation

Contents

Guidelines for Vectorization

Writing Vectorizable Code

Using Aligned Data Structures