Hello,
I am seeing some surprising performance with OpenMP task support with Intel C++ 19.0 Update 5 that I don't get with GCC 9.2. In the demo app below I expect the in-loop taskwait or the alternative taskgroup to cause it to have a single thread load and run about the same speed as the serial application. GCC gives this but Intel C++ gives 100% CPU load and a 1.8x slowdown. More importantly for our real application, we get the same slowdowns instead of speedups using a set of tasks within a taskgroup or followed by a taskwait.
// Demo for Intel C++ 19.0 Update 5 OpenMP performance issues // Serial speed of Intel C++ is ~3.3x slower than GCC 9.2.0 // With only taskwait after while loop on quad-core Haswell CPU: // Intel C++: 2.8x speedup // GCC: 3.5x speedup // Both use 100% CPU as expected // With taskwait in while loop: // Intel C++: 100% CPU usage and give 1.8x slowdown // GCC: 1 CPU/thread used and no slowdown as expected // This taskwait is not needed here but the same issue is seen in real application with multiple tasks followed by a taskwait // Same behavior seen with a taskgroup around the one task instead of this taskwait // icl /Qstd=c++11 /DNOMINMAX /DWIN32_LEAN_AND_MEAN /DNDEBUG /Qopenmp /O3 #include <atomic> #include <cstddef> #include <iostream> #include <omp.h> int main() { #pragma omp parallel { #pragma omp single { bool run( true ); std::size_t i( 0 ); std::atomic_size_t sum( 0u ); double const wall_time_beg( omp_get_wtime() ); while ( run ) { // #pragma omp taskgroup // Same behavior as the in-loop taskwait { #pragma omp task shared(sum) { std::size_t loc( 0u ); for ( std::size_t k = 0u; k < 2000000000u; ++k ) loc += k/2; sum += loc; } // omp task } // omp taskgroup #pragma omp taskwait // GCC gives expected 1 CPU/thread usage: Intel C++ gives 100% CPU and 1.8x slowdown! if ( ++i > 50u ) run = false; } #pragma omp taskwait std::cout << "sum = "<< sum << ''<< omp_get_wtime() - wall_time_beg << ''<< i << std::endl; } // omp single } // omp parallel }
Can anyone shed light on this? Looks like a buggy task implementation but maybe there is more to it.
Thanks,
Stuart