Main Navigation

Secondary Navigation

Page Contents

Contents

Matrix vector multiplication

We start with a program code written in C in 2 files, multiply.c and driver.c. It is a benchmark code that performs many times the same operation calling the function matvec in file multiply.c. To start with, the matrix dimension is choosen to be 64x63. This algorithm requires 2x63x64 operations, thus, measuring the time spent, we can simply calculate the performance of our code. All experiments are performed on an exclusive allocated node with CPUs of type Intel Xeon 6154 with hyperthreading disabled. This CPU provides 2 AVX-512 units. The CPU has a L3 cache sized 25 MB (shared among all cores of that CPU) and each core has a 1 MB sized L2 cache, 32 KB L1 each for data and instructions. The size is choosen to 63x64 so that the matrix together with both vectors fits almost entirely into the L1 cache.

This text first describes code modifications and usage of compiler switches for the Intel Compiler in details and then the same code modification steps with the GNU compiler. Results are collected in Table 1 and Table 2, respectively.

Code Analysis for the Intel Compiler icc

Experiments are using intel/2018.

Basics, Looking at the Code

We start compilation with the optimization option -O2 which should permit auto vectorization.

Take a look at the following code snippet which actually defines a matrix vector multiplication.

It is a bit unusual as we define the loop increments inc_i and inc_j extern in a different file and use a define macro to specify the datatypes of a, b and x. The matrix a has been allocated as a 1D array of size cols*rows.

But anyway, the code defines a simple matrix vector multiplication which is known to perform well and both the loops should be feasible for vectorization.

Unfortunately the compilers do not interpret the names of a function and do not recognize a matrix vector multiplication by itself. Thus, compiling and executing the code for a 64x63 sized matrix results in very disappointing 2.4 GFlops which means on a core with 2.7 GHz not even 4% of the Peak Performance.

One reason is the extern determination of inc_i and inc_j. At compilation the compiler does not know what do do with the loops. The variables inc_i and inc_j could contain any values like -1, 0 or 1. And either value would lead to a total different meaning of the function. After setting both values to 1 compilation with the Intel compiler indeed vectorizes the code and improves performance to 4.3 GFlops - still a very disappointing value. The GNU compiler, on the other side, is still unwilling to vectorize anything.

Vectorization using AVX512

Now we look a bit more into vectorization and compile the source file multiply.c asking for the assembly code
icc -S -O2 -c multiply.c
and look for the usage of vector registers "greping" for the string "mm":
grep mm multiply.s

Indeed a lot of lines are displayed but all of them contain references to xmm the 128-bit sized register for SSE registers. And indeed, after adding the option -xavx we will recognize the usage of ymm registers (256-bit) and exchanging this against -xCOMMON-AVX512 we will find in the assembler output lines with zmm registers. But only a minor performance increase to 11.2 GFlops.

The arguments to the matrix vector multiplication are passed as usual as pointers. Though we know, that 2 vectors and 1 matrix are passed as arguments, the compiler does not. It may as well be that we pass the vector b twice using the same pointer for x and b or that vectors and matrix passed are overlapping. There are a lot possibilities that make no sense to us, but the compiler has to produce correct output whatever absurd pointers we may pass to the function. Ok, nothing of this holds so we inform the compiler about it. The simplest way to do so, is to assure that there are no data dependencies in the loop to be optimized with help of the compiler pragma

#pragma ivdep
where ivdep stands for ignore vector dependency. With help of this directive we reach out for 21.2 GFlops using the Intel compiler which is a long way better than the 2.4 GFlops we started with.

Most of the possible achievment is reached at this point. If you stop here, you got 90%. If you want more, continue reading.

Padding

Still we can achieve a bit more. In our example we are using a 64x63 sized matrix. that is, rows=64 and cols=63. In case we start with 0 being the adress for element a(0,0)=a[0] then the first element in the next row a(1,0)=a[63] has adress 63x8 which is a misaligned adress to start vectorization. 64x8 would be a smarter adress (reason for that here or next paragraph below). Thus we could add some 64 elements to our matrix and enlarge it artificially to a 64x64 size matrix, still work on a 63x64 matrix but assure a smart adress at each start of a new column for the price of adding an adress jump in each row. This technique is called padding. Doing so, we win some performance and are now up to 22.1 GFlops and prepare for the next optimization step which is only meaningful in combination with this padding.

Alignment

The vectorization report for the Intel compiler generated in file multiply.optrpt looks like:

The outer loop was not changed but the inner loop was three times duplicated and one of the versions was vectorized. A AVX512 register can admit 8 doubles. The first element of each of the arrays (see Data Alignment) is always starting at an adress dividable by 8. But this is not a natural alignment for loading a 512-bit AVX register that requires an adress dividable by 64. Loading not aligned data requires additional instructions and perhaps even only a partial load. This is handled by the first additional loop, the so called "Peeled loop". It is executed if necessary and shows reduced performance.

The loop in the middle is the one that actually does the work and the last loop copes with the case that the problem size is not dividable by 8, or more precise, those elements at the end of the loop which do not fill up a complete AVX512 register (remainder loop). Again this loop will show reduced performance. but there is nothing to be done to this remainder loop.

We know already (see Data Alignment) how to allocate arrays at 64 byte boundaries. But the compiler does not know about the alignment of our arrays and we win some additional performance if we do so. We may use a builtin function (an extension to C available both for icc and gcc) and (in case of icc) add pragma vector aligned. This pragma is only working correctly after padding the matrix to 64 columns (COLBUF=1).

Now we are almost done. We may allow the compiler to exchange informations between all files to be compiled with the option -ipo (interprocedural optimization) and gain that way another performance winst to now 27.7 GFlops.

The Intel compiler as well as the GNU compiler provide many different options to the different AVX-512 subversions available (see here). Most important to us are two of them: -xCOMMON-AVX512 and -xCORE-AVX512. The first one (see here) generates code for all type of AVX-512 hardware and adds, at the time of writing this, code as well for Skylake as Knights Landing specific instruction sets. It is common usable. At the time of writing this, -xCOMMON-AVX512 achieved in most cases better results on Skylake than -xCORE-AVX512 compiled on a Skylake CPU, except for the final compilation of this example there it adds another 10% performance gain (see table 1).

We conclude with the final set of compiler options compiling the whole program:

icc -O2 -ipo -xCORE-AVX512 -qopt-zmm-usage=high -c multiply.c
icc -O2 -ipo -c driver.c
icc -O2 -ipo -o final_program driver.o multiply.o
changesperformance [GFLOPS]comments
Start with -O22.4 
known increments (=1)4.3vectorization, but only xmm
AVX-51211.2using zmm registers
ivdep 21.2
padding22.1COLBUF=1
align22.9pragma vector align
align27.7added -ipo to all compilation steps
CORE-AVX51229.3
Table 1: Performance optimization steps with Intel's compiler icc
Top of file

Code Analysis for the GNU compiler gcc

Experiments are using gcc8.0. Basic options for compiling and linking are:

gcc -O2 -lm
there -lm adds the math library libm.a in the linking step.

The vectorization report may be enabled with some of the switches -ftree-vectorize -fopt-info-vec-missed -fopt-info-loop-optimized to the gnu compiler gcc. As can be seen from table 2, simple compilation gives only poor performance of 2.2 GFlops.

Close analysis reveals, that the loop over the inmost loop performs a summation into a single value b[i] with i constant for all j. This type of operation is called a reduction and is a bit sensitive to rounding errors. To prevent any errors, which indeed are possible but rather seldom, the gnu compiler gcc never vectorizes this type of operation if not explicitely allowed to. The option -ffast-math opens this type of operations for vectorization. In our case only xmm registers with 128 Bits are used for vectorization without specifying that pointers are not overlapping.

changesperformance [GFLOPS]comments
Start with -O22.2 
known increments (=1). Options:
-mavx512f -march=skylake-avx512 -ffast-math
2.7vectorization, but only xmm
AVX-512 ivdep and restrict8.6using zmm registers
adding -flto20.5
padding28.0COLBUF=1
align29.2
Table 2: Performance optimization steps with the GNU compiler gcc
Top of file

Hereby it should be mentioned, that the ivdep pragma has to be changed slightly and that the restrict keyword for the should be added, that is, the corresponding code lines (only those are given here) will look like:

void matvec(unsigned int rows, unsigned int cols,
            FTYPE *restrict a, FTYPE *restrict b, FTYPE *restrict x)
.
#pragma GCC ivdep
        for (j = 0; j < cols; j += inc_j) {
Still the performance of 8.6 GFlops is rather disappointing. But gcc can do much better, if we switch on the link time optimization which seems to be crucial for effective optimization and fully ignoring dependencies. By adding -flto to the compile options the performance is boosted to 20.5 GFlops for this step. Important to know is, that this option -flto has to be added to all stages of the compilation, i.e. in a Makefile-procedure as compile option to all files and as final link option. Otherwise it shows no effect.

If we then add padding and alignment information, as specified above, the performance may be increased up to 29.2 GFlops. The final compilation lines look like:

gcc -c -O2 -ftree-vectorize -mavx512f -mavx512cd -ffast-math -march=skylake-avx512 -flto multiply.c
gcc -c -O2 -flto driver.c
gcc -o final_gcc -O2 -flto driver.o multiply.o -lm

Enlarging the problem size

Even though we have obtained a performance boost, starting from appr. 2 GFlops to almost 30 GFlops, we still have reached less than 50% of the peak performance of a core. What is wrong?

The 63x64 sized matrix fits into the L1 cache but not into registers. Running through our loops requires loading the matrix from L1 into registers. These loads require almost as long as the arithmetic operations themselves doubling the time needed. Thus, even if we reach peak performance for the inmost loop execution, we can not reach more than 50% of peak performance in total.

The situation gets worse, if we enlarge our matrix so that it does not fit anymore into the L1 cache.