Extremely poor vectorized code generation by Intel C++ compiler (AVX512)

FWIW this is another attempt to get some attention to an unfortunate vectorization-related issue with Intel C++ compiler. I've tried escalating the problem via Intel premier support (issue #6000162527), where it seems to have died (still no kind of technical feedback after 2 months).

I'm using Intel C++ compiler to develop a highly vectorized graphics application for an AVX512 (KNL) machine. My application makes heavy use of a C++ template abstraction layer which facilitates writing vectorized code using the AoS (Array of Structures approach).

Unfortunately, the current release (16.0.3) of ICPC generates exceedingly poor code using this approach, which has become a total blocker for my project. The C++ file at the bottom contains an example that exemplifies some of my difficulties on a trivial example.

The most relevant part of the attached file is this function which sums three streams of 3D vectors and stores it in a fourth stream.

void arraySum(size_t N, DynamicVector3 &v1, DynamicVector3 &v2, DynamicVector3 &v3, DynamicVector3 &v4) {
    // Extract a slice (Float16 * pointer) from each array, add, and store
    for (size_t i = 0; i < N; ++i)
        v1.slice(i) = v2.slice(i) + v3.slice(i) + v4.slice(i);
}

The file also contains all needed helper types, including a Wrapper around __m512 ("Float16"), a class that stores a number of Float16s on the heap ("DynamicArray"), and a Vector3 class that can represent 3D vectors of DynamicArrays, Float16 instances, or pointers to Float16 instances. I've tried to make it as compact as possible while demonstrating the poor code generation problem.

GCC and Clang generate excellent code for this. For instance, on GCC (g++-7 test.cpp -I include -std=c++14 -march=knl -O3 -S -o output-gcc.s) the loop turns into

L3:
  vmovaps (%r11,%rax), %zmm0
  vmovaps (%r10,%rax), %zmm1
  vaddps 0(%r13,%rax), %zmm0, %zmm0
  vmovaps (%r9,%rax), %zmm2
  vaddps (%r12,%rax), %zmm1, %zmm1
  vaddps (%rbx,%rax), %zmm2, %zmm2
  vaddps (%rsi,%rax), %zmm0, %zmm0
  vaddps (%rcx,%rax), %zmm1, %zmm1
  vaddps (%rdx,%rax), %zmm2, %zmm2
  vmovaps %zmm2, (%r14,%rax)
  vmovaps %zmm1, (%r15 ,%rax)
  vmovaps %zmm0, (%r8,%rax)
  leaq 64(%rax), %rax
  cmpq %rax, %rdi
  jne L3

i.e. 9 aligned loads, 6 adds, 3 aligned stores, plus loop-related instructions. Great! This is what I am expecting to get. For various reasons, I would like to use the Intel compiler for my project though.

Here is the output from the Intel Compiler for comparison (icpc test.cpp -I include -std=c++14 -xMIC-AVX512 -O3 -S -o output-icpc.s)

# LOE rax rdx rcx rbx rsi rdi r8 r9 r14 r15
L_B1.3: # Preds L_B1.5 L_B1.2
  movq (%r9), %r13 #77.32 c1
  movq (%rsi), %r12 #77.18 c1
  addq %rax, %r13 #77.18 c5 stall 1
  movq 8(%rsi), %r11 #77.18 c5
  movq 16(%rsi), %r10 #77.18 c5
  addq %rax, %r12 #77.18 c5
  vmovups (%r13), %zmm0 #77.46 c9 stall 1
  movq 8(%r9), %r13 #77.32 c9
  addq %rax, %r11 #77.18 c9
  addq %rax, %r10 #77.18 c9
  addq %rax, %r13 #77.18 c13 stall 1
  vmovups (%r13), %zmm2 #77.46 c15
  movq 16(%r9), %r13 #77.32 c15
  addq %rax, %r13 #77.18 c19 stall 1
  vmovups (%r13), %zmm4 #77.46 c21
  movq (%rcx), %r13 #77.46 c21
  addq %rax, %r13 #77.18 c25 stall 1
  vaddps (%r13), %zmm0, %zmm1 #77.46 c27
  vmovups %zmm1, (%rsp) #77.46 c33 stall 2
  movq 8(%rcx), %r13 #77.46 c33
  addq %rax, %r13 #77.18 c37 stall 1
  vaddps (%r13), %zmm2, %zmm3 #77.46 c39
  vmovups %zmm3, 64(%rsp) #77.46 c45 stall 2
  movq 16(%rcx), %r13 #77.46 c45
  addq %rax, %r13 #77.18 c49 stall 1
  vaddps (%r13), %zmm4, %zmm5 #77.46 c51
  vmovups %zmm5, 128(%rsp) #77.46 c57 stall 2
  vmovups (%rsp), %zmm6 #77.46 c6 3 stall 2
  vmovups %zmm6, 192(%rsp) #77.46 c69 stall 2
  vmovups 64(%rsp), %zmm7 #77.46 c69
  vmovups %zmm7, 256(%rsp) #77.46 c75 stall 2
  vmovups 128(%rsp), %zmm8 #77.46 c75
  vmovups %zmm8, 320(%rsp) #77.46 c81 stall 2
# LOE rax rdx rcx rbx rsi rdi r8 r9 r10 r11 r12 r14 r15
L_B1.4: # Preds L_B1.3
  movq (%r8), %r13 #77.60 c1
  vmovups 192(%rsp), %zmm0 #77.46 c1
  addq %rax, %r13 #77.18 c5 stall 1
  vmovups 256(%rsp), %zmm2 #77.60 c5
  vaddps (%r13), %zmm0, %zmm1 #77.60 c7
  vmovups %zmm1, 384(%rsp) #77.60 c13 stall 2
  movq 8(%r8), %r13 #77.60 c13
  addq %rax, %r13 #77.18 c17 stall 1
  vmovups 320(%rsp), %zmm4 #77.60 c17
  vaddps (%r13), %zmm2, %zmm3 #77.60 c19
  vmovups %zmm3, 448(%rsp) #77.60 c25 stall 2
  movq 16(%r8), %r13 #77.60 c25
  addq %rax, %r13 #77.18 c29 stall 1
  vaddps (%r13), %zmm4, %zmm5 #77.60 c31
  vmovups %zmm5, 512(%rsp) #77.60 c37 stall 2
  vmovups 384(%rsp), %zmm6 #77.60 c43 stall 2
  vmovups %zmm6, 576(%rsp) #77.60 c49 stall 2
  vmovups 448(%rsp), %zmm7 #77.60 c49
  vmovups %zmm7, 640(%rsp) #77.60 c55 stall 2
  vmovups 512(%rsp), %zmm8 #77.60 c55
  vmovups %zmm8, 704(%rsp) #77.60 c61 stall 2
L_B1.5: # Preds L_B1.4
  vmovups 576(%rsp), %zmm0 #77.21 c1
  vmovups %zmm0, (%r12) #77.21 c7 stall 2
  vmovups 640(%rsp), %zmm1 #77.21 c7
  vmovups %zmm1, (%r11) #77.21 c13 stall 2
  vmovups 704(%rsp), %zmm2 #77.21 c13
  vmovups %zmm2, (%r10) #77.21 c19 stall 2
  addq $1, %rdx #76.5 c19
  addq $64, %rax #76.5 c19
  cmpq %rdi, %rdx #76.5 c21
  jb L_B1.3 # Prob 82% #76.5 c23
# LOE rax rdx rcx rbx rsi rdi r8 r9 r14 r15
L_B1.6: # Preds L_B1.5

Yikes! Needless to say, this excessive use of spilling and stack memory eliminates all benefits of vectorization. I have a hard time figuring out what exactly is the root cause, but ICPC seems to spill a large number of intermediate results on the stack even though it doesn't need to: there are plenty of registers available to do all of these computations without even touching the stack.

I'm quite desperate at this point and hope that this issue can be resolved somehow. I'd be happy to provide any kind of additional details if this would be helpful.

Thank you in advance,
Wenzel Jakob

This is the full program which reproduces the issue:

// Simple toy program which adds three streams of 3D vectors and writes to a
// fourth stream
//
// Compiled with:
//
// Intel compiler:
//   $ icpc test.cpp -I include -std=c++14 -xMIC-AVX512 -O3 -S -o output-icpc.s
//
// GCC:
//   $ g++-7 test.cpp -I include -std=c++14 -march=knl -O3 -S -o output-gcc.s


#include <cstring>
#include <cstdlib>
#include <cstddef>
#include <functional>
#include <immintrin.h>

/// Wrapper around a AVX512 float vector (16x)
struct alignas(64) Float16 {
    __m512 value;

    /// Add two Float16 vectors
    Float16 operator+(Float16 f) const {
        return Float16{_mm512_add_ps(value, f.value)};
    }
};

/// List of static arrays which are stored on the heap
struct DynamicArray {
    Float16 *values;

    /// Get one "packet" by reference
    Float16 *packet(size_t i) { return values + i; }
};

/// Array of Structures style 3D vector, can be templated with T=Float16, T=Float16* or T=DynamicArray
template <typename T> struct Vector3 {
public:
    /// Initialize with component values
    template <typename... Args>
    Vector3(Args &&... args) : values{ std::forward<Args>(args)... } {}

    /// Access component 'i' of the vector (normal path)
    template <typename Type = T, std::enable_if_t<!std::is_pointer<Type>::value, int> = 0>
    auto& coeff(size_t i) { return values[i]; }

    /// Ditto, const version
    template <typename Type = T, std::enable_if_t<!std::is_pointer<Type>::value, int> = 0>
    const auto& coeff(size_t i) const { return values[i]; }

    /// Access component 'i' of the vector (if it is a pointer, dereference it)
    template <typename Type = T, std::enable_if_t<std::is_pointer<Type>::value, int> = 0>
    auto& coeff(size_t i) { return *(values[i]); }

    /// Ditto, const version
    template <typename Type = T, std::enable_if_t<std::is_pointer<Type>::value, int> = 0>
    const auto& coeff(size_t i) const { return *(values[i]); }

    /// Assign component values of another 3D vector to this vector
    template <typename T2> Vector3& operator=(Vector3<T2> v2) {
        for (int i = 0; i < 3; ++i)
            coeff(i) = v2.coeff(i);
        return *this;
    }

    /// Get a slice of pointers to static arrays (at index i). Assumes that 'T' is a DynamicArray
    auto slice(size_t i) {
        return Vector3<Float16 *>(
            values[0].packet(i),
            values[1].packet(i),
            values[2].packet(i));
    }

    /// Add two 3D vectors and return the result
    template <typename T2> auto operator+(const Vector3<T2> &v2) {
        return Vector3<decltype(coeff(0) + v2.coeff(0))>(
            coeff(0) + v2.coeff(0),
            coeff(1) + v2.coeff(1),
            coeff(2) + v2.coeff(2));
    }

    T values[3];
};

using DynamicVector3 = Vector3<DynamicArray>;

void arraySum(size_t N, DynamicVector3 &v1, DynamicVector3 &v2, DynamicVector3 &v3, DynamicVector3 &v4) {
    // Extract a slice (Float16 * pointer) from each array, add, and store
    for (size_t i = 0; i < N; ++i)
        v1.slice(i) = v2.slice(i) + v3.slice(i) + v4.slice(i);
}

Extremely poor vectorized code generation by Intel C++ compiler (AVX512)

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112