FWIW this is another attempt to get some attention to an unfortunate vectorization-related issue with Intel C++ compiler. I've tried escalating the problem via Intel premier support (issue #6000162527), where it seems to have died (still no kind of technical feedback after 2 months).
I'm using Intel C++ compiler to develop a highly vectorized graphics application for an AVX512 (KNL) machine. My application makes heavy use of a C++ template abstraction layer which facilitates writing vectorized code using the AoS (Array of Structures approach).
Unfortunately, the current release (16.0.3) of ICPC generates exceedingly poor code using this approach, which has become a total blocker for my project. The C++ file at the bottom contains an example that exemplifies some of my difficulties on a trivial example.
The most relevant part of the attached file is this function which sums three streams of 3D vectors and stores it in a fourth stream.
void arraySum(size_t N, DynamicVector3 &v1, DynamicVector3 &v2, DynamicVector3 &v3, DynamicVector3 &v4) { // Extract a slice (Float16 * pointer) from each array, add, and store for (size_t i = 0; i < N; ++i) v1.slice(i) = v2.slice(i) + v3.slice(i) + v4.slice(i); }
The file also contains all needed helper types, including a Wrapper around __m512 ("Float16"), a class that stores a number of Float16s on the heap ("DynamicArray"), and a Vector3 class that can represent 3D vectors of DynamicArrays, Float16 instances, or pointers to Float16 instances. I've tried to make it as compact as possible while demonstrating the poor code generation problem.
GCC and Clang generate excellent code for this. For instance, on GCC (g++-7 test.cpp -I include -std=c++14 -march=knl -O3 -S -o output-gcc.s) the loop turns into
L3: vmovaps (%r11,%rax), %zmm0 vmovaps (%r10,%rax), %zmm1 vaddps 0(%r13,%rax), %zmm0, %zmm0 vmovaps (%r9,%rax), %zmm2 vaddps (%r12,%rax), %zmm1, %zmm1 vaddps (%rbx,%rax), %zmm2, %zmm2 vaddps (%rsi,%rax), %zmm0, %zmm0 vaddps (%rcx,%rax), %zmm1, %zmm1 vaddps (%rdx,%rax), %zmm2, %zmm2 vmovaps %zmm2, (%r14,%rax) vmovaps %zmm1, (%r15 ,%rax) vmovaps %zmm0, (%r8,%rax) leaq 64(%rax), %rax cmpq %rax, %rdi jne L3
i.e. 9 aligned loads, 6 adds, 3 aligned stores, plus loop-related instructions. Great! This is what I am expecting to get. For various reasons, I would like to use the Intel compiler for my project though.
Here is the output from the Intel Compiler for comparison (icpc test.cpp -I include -std=c++14 -xMIC-AVX512 -O3 -S -o output-icpc.s)
# LOE rax rdx rcx rbx rsi rdi r8 r9 r14 r15 L_B1.3: # Preds L_B1.5 L_B1.2 movq (%r9), %r13 #77.32 c1 movq (%rsi), %r12 #77.18 c1 addq %rax, %r13 #77.18 c5 stall 1 movq 8(%rsi), %r11 #77.18 c5 movq 16(%rsi), %r10 #77.18 c5 addq %rax, %r12 #77.18 c5 vmovups (%r13), %zmm0 #77.46 c9 stall 1 movq 8(%r9), %r13 #77.32 c9 addq %rax, %r11 #77.18 c9 addq %rax, %r10 #77.18 c9 addq %rax, %r13 #77.18 c13 stall 1 vmovups (%r13), %zmm2 #77.46 c15 movq 16(%r9), %r13 #77.32 c15 addq %rax, %r13 #77.18 c19 stall 1 vmovups (%r13), %zmm4 #77.46 c21 movq (%rcx), %r13 #77.46 c21 addq %rax, %r13 #77.18 c25 stall 1 vaddps (%r13), %zmm0, %zmm1 #77.46 c27 vmovups %zmm1, (%rsp) #77.46 c33 stall 2 movq 8(%rcx), %r13 #77.46 c33 addq %rax, %r13 #77.18 c37 stall 1 vaddps (%r13), %zmm2, %zmm3 #77.46 c39 vmovups %zmm3, 64(%rsp) #77.46 c45 stall 2 movq 16(%rcx), %r13 #77.46 c45 addq %rax, %r13 #77.18 c49 stall 1 vaddps (%r13), %zmm4, %zmm5 #77.46 c51 vmovups %zmm5, 128(%rsp) #77.46 c57 stall 2 vmovups (%rsp), %zmm6 #77.46 c6 3 stall 2 vmovups %zmm6, 192(%rsp) #77.46 c69 stall 2 vmovups 64(%rsp), %zmm7 #77.46 c69 vmovups %zmm7, 256(%rsp) #77.46 c75 stall 2 vmovups 128(%rsp), %zmm8 #77.46 c75 vmovups %zmm8, 320(%rsp) #77.46 c81 stall 2 # LOE rax rdx rcx rbx rsi rdi r8 r9 r10 r11 r12 r14 r15 L_B1.4: # Preds L_B1.3 movq (%r8), %r13 #77.60 c1 vmovups 192(%rsp), %zmm0 #77.46 c1 addq %rax, %r13 #77.18 c5 stall 1 vmovups 256(%rsp), %zmm2 #77.60 c5 vaddps (%r13), %zmm0, %zmm1 #77.60 c7 vmovups %zmm1, 384(%rsp) #77.60 c13 stall 2 movq 8(%r8), %r13 #77.60 c13 addq %rax, %r13 #77.18 c17 stall 1 vmovups 320(%rsp), %zmm4 #77.60 c17 vaddps (%r13), %zmm2, %zmm3 #77.60 c19 vmovups %zmm3, 448(%rsp) #77.60 c25 stall 2 movq 16(%r8), %r13 #77.60 c25 addq %rax, %r13 #77.18 c29 stall 1 vaddps (%r13), %zmm4, %zmm5 #77.60 c31 vmovups %zmm5, 512(%rsp) #77.60 c37 stall 2 vmovups 384(%rsp), %zmm6 #77.60 c43 stall 2 vmovups %zmm6, 576(%rsp) #77.60 c49 stall 2 vmovups 448(%rsp), %zmm7 #77.60 c49 vmovups %zmm7, 640(%rsp) #77.60 c55 stall 2 vmovups 512(%rsp), %zmm8 #77.60 c55 vmovups %zmm8, 704(%rsp) #77.60 c61 stall 2 L_B1.5: # Preds L_B1.4 vmovups 576(%rsp), %zmm0 #77.21 c1 vmovups %zmm0, (%r12) #77.21 c7 stall 2 vmovups 640(%rsp), %zmm1 #77.21 c7 vmovups %zmm1, (%r11) #77.21 c13 stall 2 vmovups 704(%rsp), %zmm2 #77.21 c13 vmovups %zmm2, (%r10) #77.21 c19 stall 2 addq $1, %rdx #76.5 c19 addq $64, %rax #76.5 c19 cmpq %rdi, %rdx #76.5 c21 jb L_B1.3 # Prob 82% #76.5 c23 # LOE rax rdx rcx rbx rsi rdi r8 r9 r14 r15 L_B1.6: # Preds L_B1.5
Yikes! Needless to say, this excessive use of spilling and stack memory eliminates all benefits of vectorization. I have a hard time figuring out what exactly is the root cause, but ICPC seems to spill a large number of intermediate results on the stack even though it doesn't need to: there are plenty of registers available to do all of these computations without even touching the stack.
I'm quite desperate at this point and hope that this issue can be resolved somehow. I'd be happy to provide any kind of additional details if this would be helpful.
Thank you in advance,
Wenzel Jakob
This is the full program which reproduces the issue:
// Simple toy program which adds three streams of 3D vectors and writes to a // fourth stream // // Compiled with: // // Intel compiler: // $ icpc test.cpp -I include -std=c++14 -xMIC-AVX512 -O3 -S -o output-icpc.s // // GCC: // $ g++-7 test.cpp -I include -std=c++14 -march=knl -O3 -S -o output-gcc.s #include <cstring> #include <cstdlib> #include <cstddef> #include <functional> #include <immintrin.h> /// Wrapper around a AVX512 float vector (16x) struct alignas(64) Float16 { __m512 value; /// Add two Float16 vectors Float16 operator+(Float16 f) const { return Float16{_mm512_add_ps(value, f.value)}; } }; /// List of static arrays which are stored on the heap struct DynamicArray { Float16 *values; /// Get one "packet" by reference Float16 *packet(size_t i) { return values + i; } }; /// Array of Structures style 3D vector, can be templated with T=Float16, T=Float16* or T=DynamicArray template <typename T> struct Vector3 { public: /// Initialize with component values template <typename... Args> Vector3(Args &&... args) : values{ std::forward<Args>(args)... } {} /// Access component 'i' of the vector (normal path) template <typename Type = T, std::enable_if_t<!std::is_pointer<Type>::value, int> = 0> auto& coeff(size_t i) { return values[i]; } /// Ditto, const version template <typename Type = T, std::enable_if_t<!std::is_pointer<Type>::value, int> = 0> const auto& coeff(size_t i) const { return values[i]; } /// Access component 'i' of the vector (if it is a pointer, dereference it) template <typename Type = T, std::enable_if_t<std::is_pointer<Type>::value, int> = 0> auto& coeff(size_t i) { return *(values[i]); } /// Ditto, const version template <typename Type = T, std::enable_if_t<std::is_pointer<Type>::value, int> = 0> const auto& coeff(size_t i) const { return *(values[i]); } /// Assign component values of another 3D vector to this vector template <typename T2> Vector3& operator=(Vector3<T2> v2) { for (int i = 0; i < 3; ++i) coeff(i) = v2.coeff(i); return *this; } /// Get a slice of pointers to static arrays (at index i). Assumes that 'T' is a DynamicArray auto slice(size_t i) { return Vector3<Float16 *>( values[0].packet(i), values[1].packet(i), values[2].packet(i)); } /// Add two 3D vectors and return the result template <typename T2> auto operator+(const Vector3<T2> &v2) { return Vector3<decltype(coeff(0) + v2.coeff(0))>( coeff(0) + v2.coeff(0), coeff(1) + v2.coeff(1), coeff(2) + v2.coeff(2)); } T values[3]; }; using DynamicVector3 = Vector3<DynamicArray>; void arraySum(size_t N, DynamicVector3 &v1, DynamicVector3 &v2, DynamicVector3 &v3, DynamicVector3 &v4) { // Extract a slice (Float16 * pointer) from each array, add, and store for (size_t i = 0; i < N; ++i) v1.slice(i) = v2.slice(i) + v3.slice(i) + v4.slice(i); }