Main Navigation

Secondary Navigation

Page Contents


Using Intel Vectorization Advisor

Intel Advisor is a vectorization optimization and shared memory threading assistance tool for C, C++, C# and Fortran software developers and architects. This product is licensed for members of the TU Kaiserslautern for the operating system LINUX.

Vectorization Advisor supports analysis of scalar, SSE, AVX, AVX2 and AVX-512-enabled codes generated by Intel and GNU compilers auto-vectorization. It also supports analysis of "explicitly" vectorized codes which use OpenMP. This site is focused on using the Intel compiler icc. Details on using gcc are given elsewhere.

Get Started

To get started on Elwetritsch, simply type
module add intel/latest
Compile your C/C++ code with the following (additional) options:
Intel CompilerGNU compiler
-O -O
You'll find more elaborate options below
First Intel Advisor is applied to the following simple source:

To start the GUI of the advisor type:

If necessary open a new project and select the compiled executable as application. Now you may run the roofline report in the depicted menu.
Menu missing

In the result you may click on the summary tab. In the top program metrics are listed. Most important for the moment is the line Vector Instruction Set which reveals that the program vectorizes but only with the SSE2 instruction set, that is an (old) vector register with 128 Bits and has not very much in common with the 512 Bits of AVX-512 we want to use.

Metrics picture missing

Fortunately there are recommendations on this side, too, and we may read:

Recommendations missing

Important to notice is that this analysis and the compilations are carried out on a machine capable of AVX-512 instructions. These steps may be separated. The compilation may be performed on a different host instead even without AVX-512 but with support of AVX-512. Use -xCORE-AVX512 if your host is able to run AVX-512 (s. Introduction) or -axCORE-AVX512 to generate multiple, processor-specific auto-dispatch code paths for Intel processors if there is a performance benefit. It also generates a baseline code path which can run on non-AVX processors. -xHost is only meaningfull if AVX-512 is available. The analysis may be performed via command line (s. below) on a remote host and the GUI may be executed on a simple desktop system as well.
We recompile as recommended with the command line
icc -g -qopt-report=5 -O -xCORE-AVX512 -o simple01 simple01.c
and run roofline again within our advisor project and obtain a new performance metric and recommendation:

Recommendation missing

First, and most important, is a reduction in the elapsed time. Second we observe the usage of the AVX2, AVX instruction set and a relevant recommendation refering to the -qopt-zmm-usage option. Indeed our code uses now the AVX instruction set, but only the 256 bit version as can be seen from the assembly listing output provided on the survey tab. The important piece is listed here:

Assembly listing missing

More about registers and there notation is given here. YMM registers are the vector registers with width 256 bit from the AVX and AVX2 instruction set (XMM are the one from the 128 bit SSE2 set).

Back to the recommendation. We recompile our source with the options

icc -g -qopt-report=5 -O -xCORE-AVX512 -qopt-zmm-usage=high -o simple01 simple01.c
and are back with the roofline output

Recomendation missing

and indeed, this time the AVX512 vector instruction set is used, the elapsed time has been reduced further, and the vectorization gain is enlarged. Controling the assembly listing indeeds reveals that now ZMM registers are used which should be the case using the AVX512 instruction set.

Listing missing

Changing to the tab Source we may recognize that the outer loop was not touched but the inner loops which perform integer (32 bit) operations.

Listing missing

Now we may be interested in the trip counts and flops and run the analysis. Selecting the command window (right most button) reveals the possibility using advisor on a remote system without direct interactive access and without the GUI.

Command missing

In our environment this would simply read:

cd $HOME/Projekte/AVX
module add intel/latest
advixe-cl -collect tripcounts -flop -project-dir Advisor/vec002 -- simple01
This analysis provides a bit more information. The tab Code Analytics gives us possible hints for further optimizations. Our little piece of code spends just 17% of its time doing computations and needs 50% for memory transport. Thereby it reaches 76.8 GB/s memory bandwidth.
Analysis missing Analysis missing
Diving into the memory analysis for one of the loops opens for this tiny example no astonishing insights. The Memory Access Patterns Report generated is given below. It informs us that the arrays a and b are read with unit stride and that array c is written with unit stride.

Listing missing

If you switch to production code, please remove the -g flag from the above specified compiler option set.