Quantcast
Channel: Intel® Software - Intel® C++ Compiler
Viewing all articles
Browse latest Browse all 1175

Loop gets vectorized in scalar code, but no vectorization inside parallel loop

$
0
0

I am trying to reproduce the results of a legacy routine that does sparse matrix-vector multiplication with, well, legacy code like this:

void foo (const double * const a, const double * const x, double * const b, const int * const ja, int *const ia, const int n) {
  int i;
  const int bl = 2;

  for (i = 0; i < n; i++) {
    int l = (ia[i+1] - ia[i]);
    int c = ia[i] - 1;
    double s = 0.0;
    while ( l ) {
      s += a[ c     ] * x[ ja[c    ] - 1 ];
      s += a[ c + 1 ] * x[ ja[c + 1] - 1 ];
      l -= bl;
      c += bl;
    }
    b[i] = s;
  }
}

When I compile this code with Intel C compiler 16.0.1.150 using "icc -qopenmp -qopt-report=5 foo1.c", the loop in line 9 gets vectorized:

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


    Report from: Interprocedural optimizations [ipo]

  WHOLE PROGRAM (SAFE) [EITHER METHOD]: false
  WHOLE PROGRAM (SEEN) [TABLE METHOD]: false
  WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000

In the inlining report below:
   "sz" refers to the "size" of the routine. The smaller a routine's size,
      the more likely it is to be inlined."isz" refers to the "inlined size" of the routine. This is the amount
      the calling routine will grow if the called routine is inlined into it.
      The compiler generally limits the amount a routine can grow by having
      routines inlined into it.

Begin optimization report for: foo(const double *const, const double *const, double *const, const int *const, int *const, const int)

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (foo(const double *const, const double *const, double *const, const int *const, int *const, const int)) [1/1=100.0%] src/foo1.c(1,127)


    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at src/foo1.c(5,3)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at src/foo1.c(9,5)
      remark #15389: vectorization support: reference a has unaligned access   [ src/foo1.c(11,7) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 2
      remark #15399: vectorization support: unroll factor set to 4
      remark #15309: vectorization support: normalized vectorization overhead 0.149
      remark #15300: LOOP WAS VECTORIZED
      remark #15450: unmasked unaligned unit stride loads: 1
      remark #15458: masked indexed (or gather) loads: 2
      remark #15460: masked strided loads: 4
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 29
      remark #15477: vector loop cost: 18.500
      remark #15478: estimated potential speedup: 1.540
      remark #15488: --- end vector loop cost summary ---
   LOOP END

   LOOP BEGIN at src/foo1.c(9,5)
   <Remainder loop for vectorization>
      remark #15389: vectorization support: reference a has unaligned access   [ src/foo1.c(11,7) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 0.780
      remark #15301: REMAINDER LOOP WAS VECTORIZED
   LOOP END

   LOOP BEGIN at src/foo1.c(9,5)
   <Remainder loop for vectorization>
   LOOP END
LOOP END
===========================================================================

After that I insert "pragma omp parallel for" before the outer loop:

void foo (const double * const a, const double * const x, double * const b, const int * const ja, int *const ia, const int n) {
  int i;
  const int bl = 2;
#pragma omp parallel for
  for (i = 0; i < n; i++) {
    int l = (ia[i+1] - ia[i]);
    int c = ia[i] - 1;
    double s = 0.0;
    while ( l ) {
      s += a[ c     ] * x[ ja[c    ] - 1 ];
      s += a[ c + 1 ] * x[ ja[c + 1] - 1 ];
      l -= bl;
      c += bl;
    }
    b[i] = s;
  }
}

And the loop in line 9 is no longer vectorized:

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


    Report from: Interprocedural optimizations [ipo]

  WHOLE PROGRAM (SAFE) [EITHER METHOD]: false
  WHOLE PROGRAM (SEEN) [TABLE METHOD]: false
  WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000

In the inlining report below:
   "sz" refers to the "size" of the routine. The smaller a routine's size,
      the more likely it is to be inlined."isz" refers to the "inlined size" of the routine. This is the amount
      the calling routine will grow if the called routine is inlined into it.
      The compiler generally limits the amount a routine can grow by having
      routines inlined into it.

Begin optimization report for: foo(const double *const, const double *const, double *const, const int *const, int *const, const int)

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (foo(const double *const, const double *const, double *const, const int *const, int *const, const int)) [1/1=100.0%] src/foo2.c(1,127)


    Report from: OpenMP optimizations [openmp]

src/foo2.c(4:1-4:1):OMP:foo:  OpenMP DEFINED LOOP WAS PARALLELIZED

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at src/foo2.c(5,3)
   remark #15541: outer loop was not auto-vectorized: consider using SIMD directive

   LOOP BEGIN at src/foo2.c(9,5)
      remark #15523: loop was not vectorized: loop control variable c was found, but loop iteration count cannot be computed before executing the loop
   LOOP END
LOOP END
===========================================================================

 

My question is: can someone from Intel help me to understand WHY the j-loop is vectorized in my foo1, but not in foo2?

I have a humble request to answer my question "why" rather than divert the discussion into how to write better vectorizable code than in the above snippets. My goal is not to optimize performance, but rather, to write a routine that reproduces bit-to-bit the results of a legacy routine, in which this loop is NOT vectorized. For that purpose, I would like the loop in line 9 NOT to be vectorized in my new code, because otherwise vector reduction changes the order of operations, which slightly changes the numerical results. From the tests that I showed above it looks like that I can disable vectorization with one of the two ways:

  1. "Parallelize" the i-loop, possibly with 1 thread, so that the compiler gives up on vectorization of the inner loop, but I don't know if this workaround will survive until the next compiler version.
  2. Put "pragma novector" before the j-loop, which does prevent vectorization, but does not recover bit-to-bit the results of the original code (possibly because the original code has some automatic unrolling).

Understanding why the compiler vectorizes the code in foo1 but not in foo2 will help me to better set up a strategy for writing code that produces bitwise-identical results to a legacy code.


Viewing all articles
Browse latest Browse all 1175

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>