I'm running into a very annoying compiler optimization issue that's causing crashes on older systems (CPU, OS). I'm just using an example here to demonstrate the issue:
switch (codepath) { case AVX: __m256 bla = _mm256_setzero_ps(); *x = bla; break; case SSE2: __m128 bla = _mm_setzero_ps(); *x = bla; break; default: float bla = 0; *x = bla; break; }
The problem is, that for some reason the compiler "thinks" that it's a good idea to move certain instructions outside of the switch() statement. So, in assembly code, it can do a _mm256_setzero_ps() before checking for CPU type - I'm guessing that it works for both the AVX and SSE2 case here because half of the register is shared. Also, in the SSE2 code, probably due to the fact that I'm also using AVX code in the same function, movps instructions are replaced by vmovps.
What I want:
- 1 executable that targets multiple instruction sets
- No "Genuine Intel" checks. This needs to work on AMD as well.
- Easy to use, if possible I don't want to write separate functions for separate targets (I need this in dozens of places in my code)
- For debugging purposes, a way to dynamically choose other code paths runtime
In debug mode everything works as expected, and in release mode, it appears to work fine for 42 of the 44 places where I'm doing this. But those other two are causing crashes, and I obviously don't want to have code that might break on each new compiler version.
Using the dispatch-behavior doesn't really work because it doesn't appear to work for AMD, and I can't overrule it when I run the binary.
Using #if __AVX__ as was suggested elsewhere in this forum doesn't work either, because the #if is parsed once for the entire build (not per code path, and it evaluates to 0, cannot be changed during runtime and probably having multiple code paths won't work well for AMD either).