Loop gets vectorized in scalar code, but no vectorization inside parallel loop

I am trying to reproduce the results of a legacy routine that does sparse matrix-vector multiplication with, well, legacy code like this:

void foo (const double * const a, const double * const x, double * const b, const int * const ja, int *const ia, const int n) {
  int i;
  const int bl = 2;

  for (i = 0; i < n; i++) {
    int l = (ia[i+1] - ia[i]);
    int c = ia[i] - 1;
    double s = 0.0;
    while ( l ) {
      s += a[ c     ] * x[ ja[c    ] - 1 ];
      s += a[ c + 1 ] * x[ ja[c + 1] - 1 ];
      l -= bl;
      c += bl;
    }
    b[i] = s;
  }
}

When I compile this code with Intel C compiler 16.0.1.150 using "icc -qopenmp -qopt-report=5 foo1.c", the loop in line 9 gets vectorized:

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


    Report from: Interprocedural optimizations [ipo]

  WHOLE PROGRAM (SAFE) [EITHER METHOD]: false
  WHOLE PROGRAM (SEEN) [TABLE METHOD]: false
  WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000

In the inlining report below:
   "sz" refers to the "size" of the routine. The smaller a routine's size,
      the more likely it is to be inlined."isz" refers to the "inlined size" of the routine. This is the amount
      the calling routine will grow if the called routine is inlined into it.
      The compiler generally limits the amount a routine can grow by having
      routines inlined into it.

Begin optimization report for: foo(const double *const, const double *const, double *const, const int *const, int *const, const int)

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (foo(const double *const, const double *const, double *const, const int *const, int *const, const int)) [1/1=100.0%] src/foo1.c(1,127)


    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at src/foo1.c(5,3)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at src/foo1.c(9,5)
      remark #15389: vectorization support: reference a has unaligned access   [ src/foo1.c(11,7) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 2
      remark #15399: vectorization support: unroll factor set to 4
      remark #15309: vectorization support: normalized vectorization overhead 0.149
      remark #15300: LOOP WAS VECTORIZED
      remark #15450: unmasked unaligned unit stride loads: 1
      remark #15458: masked indexed (or gather) loads: 2
      remark #15460: masked strided loads: 4
      remark #15475: --- begin vector loop cost summary ---
      remark #15476: scalar loop cost: 29
      remark #15477: vector loop cost: 18.500
      remark #15478: estimated potential speedup: 1.540
      remark #15488: --- end vector loop cost summary ---
   LOOP END

   LOOP BEGIN at src/foo1.c(9,5)
   <Remainder loop for vectorization>
      remark #15389: vectorization support: reference a has unaligned access   [ src/foo1.c(11,7) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 2
      remark #15309: vectorization support: normalized vectorization overhead 0.780
      remark #15301: REMAINDER LOOP WAS VECTORIZED
   LOOP END

   LOOP BEGIN at src/foo1.c(9,5)
   <Remainder loop for vectorization>
   LOOP END
LOOP END
===========================================================================

After that I insert "pragma omp parallel for" before the outer loop:

void foo (const double * const a, const double * const x, double * const b, const int * const ja, int *const ia, const int n) {
  int i;
  const int bl = 2;
#pragma omp parallel for
  for (i = 0; i < n; i++) {
    int l = (ia[i+1] - ia[i]);
    int c = ia[i] - 1;
    double s = 0.0;
    while ( l ) {
      s += a[ c     ] * x[ ja[c    ] - 1 ];
      s += a[ c + 1 ] * x[ ja[c + 1] - 1 ];
      l -= bl;
      c += bl;
    }
    b[i] = s;
  }
}

And the loop in line 9 is no longer vectorized:

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.


    Report from: Interprocedural optimizations [ipo]

  WHOLE PROGRAM (SAFE) [EITHER METHOD]: false
  WHOLE PROGRAM (SEEN) [TABLE METHOD]: false
  WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000

In the inlining report below:
   "sz" refers to the "size" of the routine. The smaller a routine's size,
      the more likely it is to be inlined."isz" refers to the "inlined size" of the routine. This is the amount
      the calling routine will grow if the called routine is inlined into it.
      The compiler generally limits the amount a routine can grow by having
      routines inlined into it.

Begin optimization report for: foo(const double *const, const double *const, double *const, const int *const, int *const, const int)

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (foo(const double *const, const double *const, double *const, const int *const, int *const, const int)) [1/1=100.0%] src/foo2.c(1,127)


    Report from: OpenMP optimizations [openmp]

src/foo2.c(4:1-4:1):OMP:foo:  OpenMP DEFINED LOOP WAS PARALLELIZED

    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at src/foo2.c(5,3)
   remark #15541: outer loop was not auto-vectorized: consider using SIMD directive

   LOOP BEGIN at src/foo2.c(9,5)
      remark #15523: loop was not vectorized: loop control variable c was found, but loop iteration count cannot be computed before executing the loop
   LOOP END
LOOP END
===========================================================================

My question is: can someone from Intel help me to understand WHY the j-loop is vectorized in my foo1, but not in foo2?

I have a humble request to answer my question "why" rather than divert the discussion into how to write better vectorizable code than in the above snippets. My goal is not to optimize performance, but rather, to write a routine that reproduces bit-to-bit the results of a legacy routine, in which this loop is NOT vectorized. For that purpose, I would like the loop in line 9 NOT to be vectorized in my new code, because otherwise vector reduction changes the order of operations, which slightly changes the numerical results. From the tests that I showed above it looks like that I can disable vectorization with one of the two ways:

"Parallelize" the i-loop, possibly with 1 thread, so that the compiler gives up on vectorization of the inner loop, but I don't know if this workaround will survive until the next compiler version.
Put "pragma novector" before the j-loop, which does prevent vectorization, but does not recover bit-to-bit the results of the original code (possibly because the original code has some automatic unrolling).

Understanding why the compiler vectorizes the code in foo1 but not in foo2 will help me to better set up a strategy for writing code that produces bitwise-identical results to a legacy code.

Loop gets vectorized in scalar code, but no vectorization inside parallel loop

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112