I'm trying to make an optimized parallel version of [opencv SURF][1] and in particular [surf.cpp][2] using Intel C++ compiler.
I'm using Intel Advisor to locate inefficient and unvectorized loops. In particular, it suggests to rebuild the code using the `icpc` compiler (instead of `gcc`) and then to use the `xCORE-AVX2` flag since it's available for my hardware.
So my original `cmake` for building opencv using `g++` was:
cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo -D CMAKE_INSTALL_PREFIX=... -D OPENCV_EXTRA_MODULES_PATH=... -DWITH_TBB=OFF -DWITH_OPENMP=ON
And built the application which uses SURF with `g++ ... -O3 -g -fopenmp`
Using `icpc` instead is:
cmake -D CMAKE_BUILD_TYPE=RelWithDebInfo -D CMAKE_INSTALL_PREFIX=... -D OPENCV_EXTRA_MODULES_PATH=... -DWITH_TBB=OFF -DWITH_OPENMP=ON -DCMAKE_C_COMPILER=icc -DCMAKE_CXX_COMPILER=icpc -DCMAKE_CXX_FLAGS="-debug inline-debug-info -parallel-source-info=2 -ipo -parallel -xCORE-AVX2 -Bdynamic"
(in particular notice `-DCMAKE_C_COMPILER -DCMAKE_CXX_COMPILER -DCMAKE_CXX_FLAGS`)
And compiled the SURF application with: `-g -O3 -ipo -parallel -qopenmp -xCORE-AVX2` and `-shared-intel -parallel` for linking
I thought that the `icpc` solution was going to be faster than the `g++` one, but it isn't: `icpc` takes 0.15s while `g++` takes `0.12`s (I ran the experiments several times and these numbers are reliable).
Why this happens? Am I doing something wrong with `icpc`?
[1]: http://docs.opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/py_surf_in...
[2]: https://github.com/opencv/opencv_contrib/blob/master/modules/xfeatures2d...