ICC 16.0.2: MPX pass creates too-narrow bounds in SSE-heavy code

Hello,

I use icc (ICC) version 16.0.2 (20160204). I found a bug in the way its MPX transformation pass creates bounds for SSE-heavy (and heavily-optimized) code. My computer has an Intel Skylake CPU.

Here is the minimal test case that reproduces the problem (adapted from Vips program where the bug was triggered):

#define SCALE (1<<6)
float ar[SCALE + 1][SCALE + 1][4];

void __attribute__ ((noinline)) foo() {
    int x, y;
    for( x = 0; x < SCALE + 1; x++ )
        for( y = 0; y < SCALE + 1; y++ ) {
            double X, Y, Xd, Yd;
            double c1, c2, c3, c4;

            X = (double) x / SCALE;
            Y = (double) y / SCALE;
            Xd = 1.0 - X;
            Yd = 1.0 - Y;

            c1 = Xd * Yd;
            c2 = X * Yd;
            c3 = Xd * Y;
            c4 = X * Y;

            ar[x][y][0] = c1;
            ar[x][y][1] = c2;
            ar[x][y][2] = c3;
            ar[x][y][3] = c4;
        }
}

int main() {
    foo();
    return ar[0][0][0];
}

The code raises an exception when built with O2 and -no-check-pointers-narrowing (exactly this combination on my computer):

>>> icc -O2 -ggdb -check-pointers-mpx=rw -no-check-pointers-narrowing -lmpx -lmpxwrappers vipstest.c>>> ./a.out
Saw a #BR! status 1 at 0x400c26
Saw a #BR! status 1 at 0x400c2e
...

# now with O1: works correctly
>>> icc -O1 -ggdb -check-pointers-mpx=rw -no-check-pointers-narrowing -lmpx -lmpxwrappers vipstest.c>>> ./a.out
[ no output ]

# now without no-check-pointers-narrowing
>>> icc -O2 -ggdb -check-pointers-mpx=rw -lmpx -lmpxwrappers vipstest.c>>> ./a.out
[ no output ]

The offending asm snippet looks like this:

  bndmk  0x13(%rdx),%bnd1  # INCORRECT BOUND: TRIGGERS BR
  bndmk  0x1080f(%rdx),%bnd0  # CORRECT BUT UNUSED BOUND
  ...
  bndcl  0x603904(%rdi),%bnd1
  bndcl  0x603908(%rdi),%bnd1
  bndcl  0x60390c(%rdi),%bnd1
  bndcu  0x603917(%rdi),%bnd1  # TRIGGERS BR
  bndcu  0x60391b(%rdi),%bnd1
  bndcu  0x60391f(%rdi),%bnd1
  ...

Note that when compiled with O1 (or without no-check-pointers-narrowing), the asm uses the correct BND0 register. Clearly, some autovectorization (SSE) optimization pass clashes with the MPX instrumentation.

Thread Topic:

Bug Report