Main Navigation

Secondary Navigation

Page Contents

Contents

The goal of this tutorial is to provide guidelines for enabling compiler vectorization capability. This document is aimed at C/C++ programmers working on systems based on Intel® processors that support SIMD instructions. In Computer Science the process of converting an algorithm from a scalar implementation, which does an operation on one pair of operands at a time, to a vector process, where a single instruction can refer to a series of adjacent values, is called vectorization. To estimate the performance gains by vectorization you may run the following inline script on a host in question:
for f in `cat /proc/cpuinfo |grep avx | uniq `;do if [ "${f##avx}" != "$f" ]; then  echo $f; fi; done

In case you are not familiar to the above given line - here is some information about it.

The output may contain the following lines:

nameinstruction set#floats#doubles
"empty"no avx
avxset 1 - limited84
avx2set 2 - advanced84
avx512ffoundation168
avx512cdconflict detection168
avx512dqdouble word168
avx512bwbyte and word168
avx512vlvector length168

If you are going to use AVX the entry avx2 means that you may work on 4 doubles at a time and thus your code may run 4 times faster with vectorization. How is explained below. The output displays only the instruction sets available, avx2 is a superset of avx and avx512 is a superset of avx2. If several instruction sets are listed therefore only the highest one is responsible for the number of operands treated in vectorization (more details may be found here).

Some CPUs may contain more than one vector unit per core. Xeon Platinum, Gold 61XX, and Gold 5122 have two AVX-512 FMA units per core. Xeon Gold 51XX (except 5122), Silver, and Bronze have a single AVX-512 FMA unit per core. If there are two units, the number of operands treatable in vectorization is of course twice the number specified in the table above.

The model type of your CPU is available through

cat /proc/cpuinfo |grep "model name" | uniq

In some cases you don't even have to change the code to gain this performance boost, auto vectorization by the compiler will do the most of the job.

The sets avxdq, avxbw and avxvl are only of minor interest (details) and cover peripheral requirements like float to 64-bit integer conversion and support of 8-bit and 16-bit elements and the possibility to vectorize smaller loops.