There seems to be a bug in Intel2017. Given the following code, compiled with "icc -std=c99 -O3 *.c" (with an Ivy Bridge processor):
// File: accumulate.c void accumulate(int * offsets, double const * const restrict input, double * const restrict output) { output[0] += input[0]; // first offset is always zero for(int i = 1; i < 4; i++) output[offsets[i]] += input[i]; }
// File: main.c #include <stdio.h> void accumulate(int * offsets, double const * const restrict input, double * const restrict output); int main(void) { int offsets[4] = {0, 0, 1, 1}; double input[4] = {1.0, 2.0, 3.0, 4.0}; double output[4] = {0.0, 0.0, 0.0, 0.0}; accumulate(offsets, input, output); printf("Results: %12.6e %12.6e %12.6e %12.6e\n", output[0], output[1], output[2], output[3]); return 0; }
The resulting output is "1.000000e+00 7.000000e+00 0.000000e+00 0.000000e+00". The first value is incorrect. It is correct with Intel2016, or if I remove the restrict keywords.
In looking at the disassembled code for accumulate(), the compiler doesn't seem to realise that output[0] and output[offset[...]] can be the same memory location. It reorders the instructions so that one actually overwrites the value of the other, as opposed to letting output[0] always go first.
movsxd rax,DWORD PTR [rdi+0x4] <---- rax = 0 in this case movsxd rcx,DWORD PTR [rdi+0x8] movsxd r8,DWORD PTR [rdi+0xc] movsd xmm1,QWORD PTR [rdx+rax*8] <---- movsd xmm0,QWORD PTR [rdx] <---- [rdx] and [rdx+rax*8] are the same! addsd xmm1,QWORD PTR [rsi+0x8] addsd xmm0,QWORD PTR [rsi] movsd QWORD PTR [rdx+rax*8],xmm1 <---- puts results in output[0] movsd xmm2,QWORD PTR [rdx+rcx*8] movsd QWORD PTR [rdx],xmm0 <---- overwrites [rdx+rax*8] addsd xmm2,QWORD PTR [rsi+0x10] movsd QWORD PTR [rdx+rcx*8],xmm2 movsd xmm3,QWORD PTR [rdx+r8*8] addsd xmm3,QWORD PTR [rsi+0x18] movsd QWORD PTR [rdx+r8*8],xmm3 ret