Hi all,
We've noticed some strange behaviour with the Intel C++ compiler that we cannot explain. Our project is currently compiled with the Intel Compiler 12.1.0258 and we are looking at making performance improvements. One area we have identified is memory intensive parts of the code, where memset and memcpy can become performance bottlenecks. We develop for Mac, Linux and Windows and are aware that due to the Mac's system architecture memcpy is typically slower on this platform than Linux. I have noticed some improvements in Mac OS with Intel 17, however some very odd behaviour with Linux.
Bellow is a CPP example for testing timing I have been using:
#include <iostream> #include <sys/time.h> #include <cstdlib> #include <cstring> #define N_BYTES 1073741824 typedef long long ll; long long current_timestamp() { struct timeval tv; gettimeofday(&tv, NULL); // get current time ll milliseconds = tv.tv_sec * 1000LL + tv.tv_usec / 1000; return milliseconds; } int main(int argc, char** argv) { //Alloc some memory char* mem_location = (char*) malloc(N_BYTES); //'Warm' the memory memset(mem_location,1,N_BYTES); //Now time how long it takes to tset it to something else ll memset_begin = current_timestamp(); memset(mem_location,0,N_BYTES); ll memset_time = current_timestamp() - memset_begin; std::cout << "Memset of "<< N_BYTES << " took "<< memset_time << "ms\n"; return 0; }
The output is as follows:
[after setting compilervars.sh]
$ icpc -v
icpc version 17.0.1 (gcc version 4.4.7 compatibility)
$ icpc MemsetTime.cpp -o MemsetTime-icpc17
$ ./MemsetTime-icpc17
Memset of 1073741824 took 143ms
$ g++ -v
[...]
gcc version 4.4.7 20120313 (Red Hat 4.4.7-17) (GCC)
$ g++ MemsetTime.cpp -o MemsetTime-gpp4.4.7
$ ./MemsetTime-gpp4.4.7
Memset of 1073741824 took 50ms
..and in a fresh shell, with Intel 12 (again sourcing relevant compilevars)
$icpc -v
$icpc version 12.1.0 (gcc version 4.4.7 compatibility)
$icpc MemsetTime.cpp -o MemsetTime-icpc12
$ ./MemsetTime-icpc12
Memset of 1073741824 took 50ms
So it would appear that in Linux, the Intel 17 compiler uses an implementation of memset that takes nearly 3 times as long to run as gcc and Intel 12! After looking in VTune, I see that Intel 12 is using '_intel_fast_memset' and Intel 17 is using '_intel_avx_rep_memset'. Something even stranger is that when I compile under icpc17 with -g, the timing goes down to 50ms, and uses _GI_memset.
Has anyone else experienced this? Is there something in my timing logic that is not accurately representing the time taken for memset? These examples are all on Linux, on MacOS the timings are consistently at around the 50ms point.
Thanks!
Alastair