High Performance Computing with "Elwetritsch" at the University of Kaiserslautern-Landau

The goal of this tutorial is to provide guidelines for enabling compiler vectorization capability. This document is aimed at C/C++ programmers working on systems based on Intel® processors that support SIMD instructions. In Computer Science the process of converting an algorithm from a scalar implementation, which does an operation on one pair of operands at a time, to a vector process, where a single instruction can refer to a series of adjacent values, is called vectorization. To estimate the performance gains by vectorization you may run the following inline script on a host in question:

for f in `cat /proc/cpuinfo |grep avx | uniq `;do if [ "${f##avx}" != "$f" ]; then  echo $f; fi; done

In case you are not familiar to the above given line - here is some information about it.

The output may contain the following lines:

name	instruction set	#floats	#doubles
"empty"	no avx
avx	set 1 - limited	8	4
avx2	set 2 - advanced	8	4
avx512f	foundation	16	8
avx512cd	conflict detection	16	8
avx512dq	double word	16	8
avx512bw	byte and word	16	8
avx512vl	vector length	16	8

If you are going to use AVX the entry avx2 means that you may work on 4 doubles at a time and thus your code may run 4 times faster with vectorization. How is explained below. The output displays only the instruction sets available, avx2 is a superset of avx and avx512 is a superset of avx2. If several instruction sets are listed therefore only the highest one is responsible for the number of operands treated in vectorization (more details may be found here).

Some CPUs may contain more than one vector unit per core. Xeon Platinum, Gold 61XX, and Gold 5122 have two AVX-512 FMA units per core. Xeon Gold 51XX (except 5122), Silver, and Bronze have a single AVX-512 FMA unit per core. If there are two units, the number of operands treatable in vectorization is of course twice the number specified in the table above.

The model type of your CPU is available through

cat /proc/cpuinfo |grep "model name" | uniq

In some cases you don't even have to change the code to gain this performance boost, auto vectorization by the compiler will do the most of the job.

The sets avxdq, avxbw and avxvl are only of minor interest (details) and cover peripheral requirements like float to 64-bit integer conversion and support of 8-bit and 16-bit elements and the possibility to vectorize smaller loops.

Main Navigation

Contents