Hello,
I'm not 100% sure this is completely the right place, but since it sounds related to the usage of executables compiled with Intel compiler...
(otherwise feel free to redirect me to the proper place)
Here is my problem:
I'm using the Intel C++ 2018.2 (Linux version) to compile tool used to process scientific data heavily multithreaded and using MKL libraries.
If I'm trying to run it on my most recent PC, it crashes the PC completely (put in an undefined state which must be hard reset).
Crashes are random but occur quite rapidly. I tried compiling with different options without any luck. On reboot (after all the hassle of fscheck) I got messages like these ones:
icpc -O3 -Wuninitialized -funroll-loops -unroll-aggressive -restrict -wd3802 -xHOST
mai 29 14:20:32 erichthonios kernel: [Firmware Bug]: TSC ADJUST differs within socket(s), fixing all errors
mai 29 14:20:32 erichthonios kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14
mai 29 14:20:32 erichthonios kernel: mce: [Hardware Error]: Machine check events logged
mai 29 14:20:32 erichthonios kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 0: f200000000000005
mai 29 14:20:32 erichthonios kernel: #15
mai 29 14:20:32 erichthonios kernel: mce: [Hardware Error]: TSC 0
mai 29 14:20:32 erichthonios kernel: #16
mai 29 14:20:32 erichthonios kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1527596388 SOCKET 0 APIC 30 microcode 2000043
mai 29 14:20:32 erichthonios kernel: #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35
icpc -O3 -Wuninitialized -funroll-loops -unroll-aggressive -restrict -wd3802 -xAVX
mai 29 14:32:43 erichthonios kernel: [Firmware Bug]: TSC ADJUST differs within socket(s), fixing all errors
mai 29 14:32:43 erichthonios kernel: #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14
mai 29 14:32:43 erichthonios kernel: mce: [Hardware Error]: Machine check events logged
mai 29 14:32:43 erichthonios kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 0: b200000000070005
mai 29 14:32:43 erichthonios kernel: #15
mai 29 14:32:43 erichthonios kernel: mce: [Hardware Error]: TSC 0
mai 29 14:32:43 erichthonios kernel: #16
mai 29 14:32:43 erichthonios kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1527597117 SOCKET 0 APIC 30 microcode 2000043
mai 29 14:32:43 erichthonios kernel: #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35
icpc -O3 -Wuninitialized -funroll-loops -unroll-aggressive -restrict -wd3802
mai 29 14:43:49 erichthonios kernel: mce: [Hardware Error]: Machine check events logged
mai 29 14:43:49 erichthonios kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 0: f200000000000005
mai 29 14:43:49 erichthonios kernel: #3
mai 29 14:43:49 erichthonios kernel: mce: [Hardware Error]: TSC 0
mai 29 14:43:49 erichthonios kernel: #4
mai 29 14:43:49 erichthonios kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1527597785 SOCKET 0 APIC 4 microcode 2000043
icpc -restrict -wd3802
This is dmesg this time, sorry...
[ 0.003333] [FirmwareBug]: TSC ADJUST differs within socket(s), fixing all errors
[ 0.150022] #2 #3 #4
[ 0.223347] mce: [Hardware Error]: Machine check events logged
[ 0.223353] mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 0: f200000000000005
[ 0.250024] #5
[ 0.253338] mce: [Hardware Error]: TSC 0
[ 0.286687] #6
[ 0.290007] mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1527676563 SOCKET 0 APIC 8 microcode 2000043
More troubling is that using g++ -O3 compiled executable works prefectly... and daily normal usage (python, mail, etc...) offers also normal stability.
My hardware:
- MB: TUF X299 Mark 1 Bios 1301
- RAM: 128GB
- CPU: Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
- Video: nvidia GTX1060
- Memtest OK for a whole night (5 passes)
- Running on 12 cores. Temperature around 68C.
OS: Archlinux
kernel: Linux version 4.16.11-1-ARCH (builduser@heftig-1505) (gcc version 8.1.0 (GCC)) #1 SMP PREEMPT Tue May 22 21:40:27 UTC 2018
On an old hardware (i7-3930K CPU @ 3.20GHz; 16GB RAM; Asus P9X79) a part from overheating (89C) all work fine using exactly the same executable in the same conditions.
I tried googling around about this but didn't find any helpful answer. Could it be that the Intel compiler is producing code which is for some reason (eg this TSC) incompatible with the i9-7980XE?
I'm desperate of not being able to use the added value provided by the Intel compiler (which I bought on purpose) anymore, especially considering the high degree of vectorization implied in the application.
Any help and/or suggestion would be greatly appreciated,
Daniel