Main Navigation

Secondary Navigation

Page Contents

Contents

Autovectorization

Vectorization is the process of converting an algorithm from a scalar implementation, which does an operation one pair of operands at a time, to a vector process where a single instruction can refer to a vector (series of adjacent values). This process is referred to as auto-vectorization only to emphasize that the compiler identifies and optimizes suitable loops on its own, without requiring any special action by you. However, it is useful to note that in some cases, certain keywords or directives may be applied in the code for auto-vectorization to occur. Consider the following sample code fragment, where a, b and c are integer arrays:

If vectorization is not enabled (that is, you compile using O1 or no optimizing option, which is the default, for each iteration, the compiler processes the code such that there is a lot of unused space in the executing registers. If vectorization is enabled (compiled using O2 or higher options), the compiler may use the additional registers to perform several additions in a single instruction. The compiler looks for vectorization opportunities whenever you compile at default optimization (O2) or higher.

To get information on whether a loop was vectorized or not, enable generation of the optimization report which is specific for the compiler used. The output may look like
LOOP BEGIN at simple01.c(10,1)
  remark #15300: LOOP WAS VECTORIZED
LOOP END
using Intels compiler or
simple01.c:10:1: note: loop vectorized
using the GNU compiler. In both cases vectorization of the loop in line 10 was successful.

Obstacles

The following may either prevent vectorization or cause the compiler to decide that vectorization would not be worthwhile.

Function calls

Function calls within a loop prevent vectorization. Even a print statement is sufficient to prevent a loop from getting vectorized. The vectorization report message is typically: non-standard loop is not a vectorization candidate. The two major exceptions are for intrinsic math functions and for functions that may be inlined. Intrinsic math functions such as sin(), log(), fmax(), and so on, are allowed, because the compiler runtime library contains vectorized versions of these functions.

Non-contiguous memory access

Several consecutive integers, floating-point values, or doubles may be loaded directly from memory in a single instruction into the vector registers. But if the according elements are not adjacent, they must be loaded separately using multiple instructions, which is considerably less efficient. The most common examples of non-contiguous memory access are loops with non-unit stride or with indirect addressing, as in the examples below. The compiler rarely vectorizes such loops, unless the amount of computational work is large compared to the overhead from non-contiguous memory access.

Depending on the compiler, or even the version of the compiler, vectorization or no vectorization is reported or permutation of the nested loops.

Data dependencies

Vectorization entails changes in the order of operations within a loop, since each vector instruction operates on several data elements at once. Vectorization is only possible if this change of order does not change the results of the calculation.

The simplest case is when data elements that are written (stored to) do not appear in any other iteration of the individual loop. In this case, all the iterations of the original loop are independent of each other, and can be executed in any order, without changing the result. The loop may be safely executed using any parallel method, including vectorization. The first example and loop 1 and 3 of example 2 fall into this category.

When a variable is written in one iteration and read in a subsequent iteration, there is a “read-after-write” dependency, also known as a flow dependency. A special case of this dependency is known as reduction and occurs in the nested loops in example 2 above. All products formed in the j-loop are added to a single array element b. In the case of floating-point numbers the result of this summation may strongly depend on the summation order. The GNU compiler is opposite to other compilers very conservative with respect to this operation and requires additional permission to vectorize this kind of dependency.

A more typical “read-after-write” dependency is given by the loop
for (j=1; j<MAX; j++) A[j]=A[j-1]+1;
This cannot safely be vectorized: if the first two iterations are executed simultaneously by a vector instruction, the value of A[1] is used by the second iteration before it has been calculated by the first iteration.

When a variable is read in one iteration and written in a subsequent iteration, this is a "write-after-read" dependency, also known as an anti-dependency, as in the following example:
for (j=1; j<MAX; j++) A[j-1]=A[j]+1;
This dependency is not safe for general parallel execution, since the iteration with the write may execute before the iteration with the read. However, for vectorization, no iteration with a higher value of j can complete before an iteration with a lower value of j, and so vectorization is safe (that is, it gives the same result as non-vectorized code) in this case. The following example, however, may not be safe, since vectorization might cause some elements of A to be overwritten by the first vector instruction before being used for the second one:
for (j=1; j<MAX; j++) {
  A[j-1]=A[j]+1;
  B[j]=A[j]*2.;
}

To vectorize this loop it has to be split into two loops with assigning B first. Some compilers may do so automatically.

The compiler cannot safely vectorize a loop if there is a potential dependency. Consider the following example:
for (j=1; j<MAX; j++) C[j]=A[j]+B[j];
In the above example, the compiler needs to determine whether, for some iteration j the output C[j] might refer to the same memory location as a one of the inputs A[j] or B[j]. Such memory locations are sometimes said to be aliased. For example, if A[j] pointed to the same memory location as C[j-1], there would be a "read-after-write" dependency as in the earlier example. If the compiler cannot exclude this possibility, it will not vectorize the loop unless you provide the compiler with hints.

How you may provide additional information to the compiler is covered in the next section.