Main Navigation

Secondary Navigation

Page Contents

Contents

Data Alignment

Data structure alignment is the adjustment of any data object with relation to other objects. Alignment is a property of a memory address, expressed as the numeric address modulo of powers of two. In addition to its address, a single datum also has a size. A datum is called 'naturally aligned' if its address is aligned to its size, otherwise it is called 'misaligned'. For example, an 8-byte floating-point datum is naturally aligned if the address used to identify it is aligned to eight (8).

Arrays

Usually arrays are dynamically allocated like in the following example:
#include <stdlib.h>
float *array;
array=(float*)malloc(SIZE*sizeof(float));
The resulting array will at least be aligned to eight which is easily verified by:
printf("array starts at adress %ld\n",&array[0]);
To allocate memory aligned to different adresses POSIX provides memalign (and related functions) which allow for example an alignment to sixtyfour (64):
#include <stdlib.h>
float *array;
array=memalign(64,SIZE*sizeof(float));
The keyword __alignof__ allows you to inquire about how an object is aligned, or the minimum alignment usually required by a type. For the above examples it will return in both cases the minimum of four (4) for a float or eight (8) for a double.

If the above allocated array is subject to be vectorized starting at element 0, the performance may benefit from an alignment to 64.

To align a single variable R is dependent on the compiler:

GNU (gcc)Intel (icc)
double R __attribute__((aligned(64))); double __declspec(align(64)) R;
double R __attribute__((aligned(64)));

Structures

The most common and well-known data structure is the array that contains a contiguous collection of data items, which can be accessed by an ordinal index. If several data items describe an entity and should be grouped together this data can be organized as an array of structures (AoS) or as a structure of arrays (SoA). While AoS organization works excellently for encapsulation, for vector processing it works poorly. To illustrate this point, consider the following structure to encapsulate characteristic features (type, coordinates, mass, velocity) of a particle.
codealign/offsetsize
struct Particle {
 char type;
 double x,y;
 float m;
 double vx,vy;
};
struct Particle *particle;
-
0
8,16
24
32,40
-
-
-
1
8,8
4
8,8
37
48
All data is naturally aligned within the structure. If type is located at adress 0 the following double x is located with an offset of 8 at adress 8 and so on. As a consequence the size of this structure is 48 even though it contains only 37 bytes of meaningful data. And yes, if float m follows directly after type the offset of m will be four and the size of the structure reduced to 40.

This is not just a reduction by 8 bytes - so what? - but for several millions of particles a significant amount of memory and even more important, the effective memory bandwidth achieved is directly proportional to the amount of data to be transported. That is, 48 bytes need 6/5 times more time to be moved than 40 bytes and thus we may save 12% time by a simple rearrangement within a structure.

But this is a different story - back to vectorization. We now consider the following simple loops

array of particlesstructure of arrays (SOA)
for (int p_=0;p_<n;p_++) {
 particle[p_].y=c[p_];
}
for (int p_=0;p_<n;p_++) {
 particle->y[p_]=c[p_];
}
which both may be vectorized, but with completely different behavior. The SOA examples thereby is based on the structure
struct Particle {
char *type;
double *x,*y;
float *m;
double *vx,*vy;
};
there each particle property is stored in an array (to be allocated, s. above). Inspection of the generated assembler code reveals, that array c is loaded into the AVX register zmm0 and from there stored into array y of structure Particle.

The assembler code for the left side, which is more generally called an array of a structure (AOS) is much more involved. The structure has a size of 48 bytes and therefore the distance between y values of subsequent particles have a distance of 48 bytes, too. A vector instruction works on a vector register there all elements to be worked on are loaded, see vectorization. It is of course relatively simple to put elements of an array into a vector register. The systems starts at an adress, takes 512 bytes, that's it. And next time it just continues loading another 512 bytes.

This way of loading memory into registers is available with all flavors of the AVX instruction sets, thus fully supported on every Intel hardware with any type of AVX.

Loading strided memory adresses, in our case every 48 bytes, requires a special instruction set which for example is neither contained in AVX set 1 nor in the CORE-AVX512 set availabe with the -axCORE-AVX512 compile option. In any case it is of course much more expensive to gather information into a register and scatter the results back. Therefore it is strongly recommended

to use arrays within a structure, that is a structure of arrays (SOA) data layout, than using a structure as array elements.