I am trying to reproduce the results of a legacy routine that does sparse matrix-vector multiplication with, well, legacy code like this:
void foo (const double * const a, const double * const x, double * const b, const int * const ja, int *const ia, const int n) { int i; const int bl = 2; for (i = 0; i < n; i++) { int l = (ia[i+1] - ia[i]); int c = ia[i] - 1; double s = 0.0; while ( l ) { s += a[ c ] * x[ ja[c ] - 1 ]; s += a[ c + 1 ] * x[ ja[c + 1] - 1 ]; l -= bl; c += bl; } b[i] = s; } }
When I compile this code with Intel C compiler 16.0.1.150 using "icc -qopenmp -qopt-report=5 foo1.c", the loop in line 9 gets vectorized:
Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See "https://software.intel.com/en-us/intel-advisor-xe" for details. Report from: Interprocedural optimizations [ipo] WHOLE PROGRAM (SAFE) [EITHER METHOD]: false WHOLE PROGRAM (SEEN) [TABLE METHOD]: false WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false INLINING OPTION VALUES: -inline-factor: 100 -inline-min-size: 30 -inline-max-size: 230 -inline-max-total-size: 2000 -inline-max-per-routine: 10000 -inline-max-per-compile: 500000 In the inlining report below: "sz" refers to the "size" of the routine. The smaller a routine's size, the more likely it is to be inlined."isz" refers to the "inlined size" of the routine. This is the amount the calling routine will grow if the called routine is inlined into it. The compiler generally limits the amount a routine can grow by having routines inlined into it. Begin optimization report for: foo(const double *const, const double *const, double *const, const int *const, int *const, const int) Report from: Interprocedural optimizations [ipo] INLINE REPORT: (foo(const double *const, const double *const, double *const, const int *const, int *const, const int)) [1/1=100.0%] src/foo1.c(1,127) Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par] LOOP BEGIN at src/foo1.c(5,3) remark #15542: loop was not vectorized: inner loop was already vectorized LOOP BEGIN at src/foo1.c(9,5) remark #15389: vectorization support: reference a has unaligned access [ src/foo1.c(11,7) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 2 remark #15399: vectorization support: unroll factor set to 4 remark #15309: vectorization support: normalized vectorization overhead 0.149 remark #15300: LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 1 remark #15458: masked indexed (or gather) loads: 2 remark #15460: masked strided loads: 4 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 29 remark #15477: vector loop cost: 18.500 remark #15478: estimated potential speedup: 1.540 remark #15488: --- end vector loop cost summary --- LOOP END LOOP BEGIN at src/foo1.c(9,5) <Remainder loop for vectorization> remark #15389: vectorization support: reference a has unaligned access [ src/foo1.c(11,7) ] remark #15381: vectorization support: unaligned access used inside loop body remark #15305: vectorization support: vector length 2 remark #15309: vectorization support: normalized vectorization overhead 0.780 remark #15301: REMAINDER LOOP WAS VECTORIZED LOOP END LOOP BEGIN at src/foo1.c(9,5) <Remainder loop for vectorization> LOOP END LOOP END ===========================================================================
After that I insert "pragma omp parallel for" before the outer loop:
void foo (const double * const a, const double * const x, double * const b, const int * const ja, int *const ia, const int n) { int i; const int bl = 2; #pragma omp parallel for for (i = 0; i < n; i++) { int l = (ia[i+1] - ia[i]); int c = ia[i] - 1; double s = 0.0; while ( l ) { s += a[ c ] * x[ ja[c ] - 1 ]; s += a[ c + 1 ] * x[ ja[c + 1] - 1 ]; l -= bl; c += bl; } b[i] = s; } }
And the loop in line 9 is no longer vectorized:
Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See "https://software.intel.com/en-us/intel-advisor-xe" for details. Report from: Interprocedural optimizations [ipo] WHOLE PROGRAM (SAFE) [EITHER METHOD]: false WHOLE PROGRAM (SEEN) [TABLE METHOD]: false WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false INLINING OPTION VALUES: -inline-factor: 100 -inline-min-size: 30 -inline-max-size: 230 -inline-max-total-size: 2000 -inline-max-per-routine: 10000 -inline-max-per-compile: 500000 In the inlining report below: "sz" refers to the "size" of the routine. The smaller a routine's size, the more likely it is to be inlined."isz" refers to the "inlined size" of the routine. This is the amount the calling routine will grow if the called routine is inlined into it. The compiler generally limits the amount a routine can grow by having routines inlined into it. Begin optimization report for: foo(const double *const, const double *const, double *const, const int *const, int *const, const int) Report from: Interprocedural optimizations [ipo] INLINE REPORT: (foo(const double *const, const double *const, double *const, const int *const, int *const, const int)) [1/1=100.0%] src/foo2.c(1,127) Report from: OpenMP optimizations [openmp] src/foo2.c(4:1-4:1):OMP:foo: OpenMP DEFINED LOOP WAS PARALLELIZED Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par] LOOP BEGIN at src/foo2.c(5,3) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive LOOP BEGIN at src/foo2.c(9,5) remark #15523: loop was not vectorized: loop control variable c was found, but loop iteration count cannot be computed before executing the loop LOOP END LOOP END ===========================================================================
My question is: can someone from Intel help me to understand WHY the j-loop is vectorized in my foo1, but not in foo2?
I have a humble request to answer my question "why" rather than divert the discussion into how to write better vectorizable code than in the above snippets. My goal is not to optimize performance, but rather, to write a routine that reproduces bit-to-bit the results of a legacy routine, in which this loop is NOT vectorized. For that purpose, I would like the loop in line 9 NOT to be vectorized in my new code, because otherwise vector reduction changes the order of operations, which slightly changes the numerical results. From the tests that I showed above it looks like that I can disable vectorization with one of the two ways:
- "Parallelize" the i-loop, possibly with 1 thread, so that the compiler gives up on vectorization of the inner loop, but I don't know if this workaround will survive until the next compiler version.
- Put "pragma novector" before the j-loop, which does prevent vectorization, but does not recover bit-to-bit the results of the original code (possibly because the original code has some automatic unrolling).
Understanding why the compiler vectorizes the code in foo1 but not in foo2 will help me to better set up a strategy for writing code that produces bitwise-identical results to a legacy code.