Hello All,
Hope I am asking in the right forum!!
I have a simple/naive question, , I made a simple program to run on one thread of KNL (68 cores, Flat-Quadrant, MCDRAM used). I ran my code twice with the following configurations:
1) #pragma simd reduction(...) at the top of the loop and compiler option -xMIC_AVX512.
2) #pragma novector and removed -xMIC_AVX512 and added -no-simd. The loop is not vectorized and no AVX instructions are used (checked the assembly file).
The GFLOPS of the first one is 1.5 GFLOPS and for the second one is 0.8. The speedup is almost 2X only. Can anyone please explain why I don't get a good speedup (Closer to 8) ?
long count = 10000000 //Same loop for Cold Start stime = dsecnd();
//1) #pragma simd reduction(+:result) //2) #pragma novector for (long i = 0; i < count; i++ ) {
result += (A[i] * B[i]); } etime = dsecnd(); double bestExTime = (etime - stime); double gplops = (1.e-9 * 2.0 * count) / bestExTime; printf("%f,%f\n" ,result, gplops);
Thanks,