Hi all,
I would like to understand the behavior of this small piece of code that I have extracted from a bigger application that makes use of vectorization and simd instructions.
Please don't look at the design, it is inherited from my original code and I want to take it as it is to reproduce the anomaly, though I agree with the fact that it's senseless in this small context. I'm following the guidelines described here about the alignment.
I have the following Dummy class.
dummy.h
#ifdef __INTEL_COMPILER typedef double * __restrict__ Real_ptr __attribute__((align_value(32))); typedef const double * const __restrict__ ConstReal_ptr __attribute__((align_value(32))); #else typedef double * __restrict__ Real_ptr __attribute__((aligned(32))); typedef const double * const __restrict__ ConstReal_ptr __attribute__((aligned(32))); #endif class Dummy { public: virtual void calculate( const unsigned int n, ConstReal_ptr x, ConstReal_ptr y, Real_ptr z ) const; private: double computeSingleValue( const double x, const double y ) const; };
dummy.cpp
#include "dummy.h" #include <algorithm> static const double K = 10.0; void Dummy::calculate( const unsigned int n, ConstReal_ptr x, ConstReal_ptr y, Real_ptr z ) const { for( unsigned int i = 0; i < n; ++i) { z[i] = computeSingleValue( x[i], y[i] ); } } double Dummy::computeSingleValue( const double x, const double y ) const { return std::max(K, (x >= y) ? x : y); }
The main function tests the calculate method and couts a message in case of output different from the expected. The main.cpp is the following:
#include "dummy.h" #include <cassert> #include <cmath> #include <iostream> #include <stdlib.h> int main() { const unsigned int N = 4; Real_ptr x; assert( 0 == posix_memalign ( (void **)&x, 32, sizeof ( double ) * N ) ); x[0] = 0.0; x[1] = 10.0; x[2] = 100.0; x[3] = 1000.0; Real_ptr y; assert( 0 == posix_memalign ( (void **)&y, 32, sizeof ( double ) * N ) ); y[0] = 0.0; y[1] = 10.0; y[2] = 100.0; y[3] = 1000.0; Real_ptr z; assert( 0 == posix_memalign ( (void **)&z, 32, sizeof ( double ) * N ) ); z[0] = 0.0; z[1] = 0.0; z[2] = 0.0; z[3] = 0.0; Dummy obj; obj.calculate( N, x, y, z ); if( std::abs(10.0 - z[0])> 1.0E-18 ) { std::cout << "FAIL 0: z = "<< z[0] << std::endl; }; if( std::abs(10.0 - z[1])> 1.0E-18 ) { std::cout << "FAIL 1: z = "<< z[1] << std::endl; }; if( std::abs(100.0 - z[2])> 1.0E-18 ) { std::cout << "FAIL 2: z = "<< z[2] << std::endl; }; if( std::abs(1000.0 - z[3])> 1.0E-18 ) { std::cout << "FAIL 3: z = "<< z[3] << std::endl; }; free(x); free(y); free(z); }
Now, I'm trying to compile it with -O2 and the following compilers:
- g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-4)
- icpc (ICC) 16.0.3 20160415
With GCC everything works fine and the result is as expected, while with the Intel compiler the values in the last two elements of the z array are wrong and the output of the program is
FAIL 2: z = 10
FAIL 3: z = 10
The thing that puzzles me, apart from the compiler dependency, is that if I do one of the following things I can get the correct output:
- decrease the optimization to -O1 or -O0
- move all the source code in a single translation unit
- replace z[i] = computeSingleValue( x[i], y[i] ); with z[i] = std::max(K, (x >= y) ? x : y); in dummy.cpp
- add a std::cout << std::endl; in the body of computeSingleValue in dummy.cpp
- remove the __restrict__ keyword from ConstReal_ptr typedef
I'm probably doing something wrong, but I don't get it. Any help would be really appreciated.
Thanks in advance and regards,
Massi