Hi there,
I am compiling the code below with ICC using the following command line:
icc -w -par-threshold0 -no-vec -fno-inline -parallel -qopt-report-phase=all -qopt-report=5
1 #include <stdio.h> 2 int a[100]; 3 4 int main(int argc, char *argv[]) 5 { 6 int len=argc; 7 int i,x=10; 8 9 for (i=0;i<len;i++) 10 { 11 a[x] = i; 12 x=i; 13 } 14 15 for (i = 0; i < len; i++) 16 printf("%d ", a[i]); 17 printf("x=%d",x); 18 return 0; 19 }
The code is a modification of the following program in AutoParBench:
https://github.com/LLNL/dataracebench/blob/master/micro-benchmarks/DRB016-outputdep-orig-yes.c
The loop pattern in the code has two pair of dependencies:
1. loop carried output dependence
x = .. :
2. loop carried true dependence due to:
.. = x; // a[x]
x = ..;
Below I am showing you the report produced by ICC. It seems that ICC tried to parallelize the loop at lines 9-13.
Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See "https://software.intel.com/en-us/intel-advisor-xe" for details. Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.0.4.243 Build 20190416 Compiler options: -par-threshold0 -no-vec -fno-inline -parallel -qopt-report-phase=all -qopt-report=5 -o test.out Report from: Interprocedural optimizations [ipo] WHOLE PROGRAM (SAFE) [EITHER METHOD]: false WHOLE PROGRAM (SEEN) [TABLE METHOD]: true WHOLE PROGRAM (READ) [OBJECT READER METHOD]: false INLINING OPTION VALUES: -inline-factor: 100 -inline-min-size: 30 -inline-max-size: 230 -inline-max-total-size: 2000 -inline-max-per-routine: 10000 -inline-max-per-compile: 500000 In the inlining report below: "sz" refers to the "size" of the routine. The smaller a routine's size, the more likely it is to be inlined. "isz" refers to the "inlined size" of the routine. This is the amount the calling routine will grow if the called routine is inlined into it. The compiler generally limits the amount a routine can grow by having routines inlined into it. Begin optimization report for: main(int, char **) Report from: Interprocedural optimizations [ipo] INLINE REPORT: (main(int, char **)) [1/1=100.0%] modified_clean_DRB016-outputdep-orig-yes.c(5,1) -> EXTERN: (16,5) printf(const char *__restrict__, ...) -> EXTERN: (17,3) printf(const char *__restrict__, ...) Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par] LOOP BEGIN at modified_clean_DRB016-outputdep-orig-yes.c(9,3) remark #17109: LOOP WAS AUTO-PARALLELIZED remark #17101: parallel loop shared={ .2 } private={ } firstprivate={ argc } lastprivate={ } firstlastprivate={ i } reduction={ } remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag remark #25439: unrolled with remainder by 2 remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=100 LOOP END LOOP BEGIN at modified_clean_DRB016-outputdep-orig-yes.c(9,3) <Remainder> remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=100 LOOP END LOOP BEGIN at modified_clean_DRB016-outputdep-orig-yes.c(15,3) remark #17104: loop was not parallelized: existence of parallel dependence remark #15382: vectorization support: call to function printf(const char *__restrict__, ...) cannot be vectorized [ modified_clean_DRB016-outputdep-orig-yes.c(16,5) ] remark #15344: loop was not vectorized: vector dependence prevents vectorization remark #25015: Estimate of max trip count of loop=100 LOOP END LOOP BEGIN at modified_clean_DRB016-outputdep-orig-yes.c(9,3) remark #15540: loop was not vectorized: auto-vectorization is disabled with -no-vec flag remark #25439: unrolled with remainder by 2 remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=100 LOOP END LOOP BEGIN at modified_clean_DRB016-outputdep-orig-yes.c(9,3) <Remainder> remark #25456: Number of Array Refs Scalar Replaced In Loop: 1 remark #25015: Estimate of max trip count of loop=100 LOOP END Report from: Code generation optimizations [cg] modified_clean_DRB016-outputdep-orig-yes.c(5,1):remark #34051: REGISTER ALLOCATION : [main] modified_clean_DRB016-outputdep-orig-yes.c:5 Hardware registers Reserved : 2[ rsp rip] Available : 39[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm15] Callee-save : 6[ rbx rbp r12-r15] Assigned : 14[ rax rdx rcx rbx rsi rdi r8-r15] Routine temporaries Total : 125 Global : 33 Local : 92 Regenerable : 46 Spilled : 1 Routine stack Variables : 32 bytes* Reads : 6 [0.00e+00 ~ 0.0%] Writes : 9 [0.00e+00 ~ 0.0%] Spills : 48 bytes* Reads : 11 [5.00e+00 ~ 0.6%] Writes : 11 [0.00e+00 ~ 0.0%] Notes *Non-overlapping variables and spills may share stack space, so the total stack size might be less than this. ===========================================================================
However, intel inspector reports a data race in the loop parallelized by ICC. The contents of “log/realtime_mode.log”, generated by intel inspector, follows below.
<?xml version="1.0" encoding="UTF-8"?> <feedback> <message severity="verbose">Analysis started...</message> <nop/> <message severity="info">Collection started. To stop the collection, either press CTRL-C or enter from another console window: inspxe-cl -r /home/gleison/Desktop/Fernando_modifed_example/r005ti3 -command stop.</message> <nop/> <message severity="verbose">Result file: /home/gleison/Desktop/Fernando_modifed_example/r005ti3/r005ti3.inspxe </message> <nop/> <message severity="verbose">Found target process /home/gleison/Desktop/Fernando_modifed_example/test.out (PID = 20895). Analysis started... </message> <nop/> <message severity="verbose">Loaded module: /home/gleison/Desktop/Fernando_modifed_example/test.out. </message> <nop/> <message severity="verbose">Loaded module: /lib64/ld-linux-x86-64.so.2. </message> <nop/> <message severity="verbose">Loaded module: [vdso]. </message> <nop/> <message severity="verbose">Loaded module: /lib/x86_64-linux-gnu/libm.so.6. </message> <nop/> <message severity="verbose">Loaded module: /usr/lib/x86_64-linux-gnu/libiomp5.so. </message> <nop/> <message severity="verbose">Loaded module: /lib/x86_64-linux-gnu/libgcc_s.so.1. </message> <nop/> <message severity="verbose">Loaded module: /lib/x86_64-linux-gnu/libpthread.so.0. </message> <nop/> <message severity="verbose">Loaded module: /lib/x86_64-linux-gnu/libc.so.6. </message> <nop/> <message severity="verbose">Loaded module: /lib/x86_64-linux-gnu/libdl.so.2. </message> <nop/> <message severity="verbose">Loaded module: /opt/intel/inspector_2019.4.0.597413/lib64/runtime/libittnotify.so. </message> <nop/> <message severity="warning">One or more threads in the application accessed the stack of another thread. This may indicate one or more bugs in your application. Setting the Inspector to detect data races on stack accesses and running another analysis may help you locate these and other bugs.</message> <nop/> <message severity="verbose">Unloaded module: /home/gleison/Desktop/Fernando_modifed_example/test.out. </message> <nop/> <message severity="verbose">Unloaded module: /lib64/ld-linux-x86-64.so.2. </message> <nop/> <message severity="verbose">Unloaded module: [vdso]. </message> <nop/> <message severity="verbose">Unloaded module: /lib/x86_64-linux-gnu/libm.so.6. </message> <nop/> <message severity="verbose">Unloaded module: /usr/lib/x86_64-linux-gnu/libiomp5.so. </message> <nop/> <message severity="verbose">Unloaded module: /lib/x86_64-linux-gnu/libgcc_s.so.1. </message> <nop/> <message severity="verbose">Unloaded module: /lib/x86_64-linux-gnu/libpthread.so.0. </message> <nop/> <message severity="verbose">Unloaded module: /lib/x86_64-linux-gnu/libc.so.6. </message> <nop/> <message severity="verbose">Unloaded module: /lib/x86_64-linux-gnu/libdl.so.2. </message> <nop/> <message severity="verbose">Unloaded module: /opt/intel/inspector_2019.4.0.597413/lib64/runtime/libittnotify.so. </message> <nop/> <message severity="verbose">Process /home/gleison/Desktop/Fernando_modifed_example/test.out (PID = 20895) has terminated. </message> <nop/> <message severity="verbose">Application exit code: 0 </message> <nop/> <message severity="verbose">Result file: /home/gleison/Desktop/Fernando_modifed_example/r005ti3/r005ti3.inspxe </message> <nop/> <message severity="verbose">Analysis completed</message> <nop/> <message severity="info"> </message> <nop/> <message severity="info">1 new problem(s) found </message> <nop/> <message severity="info"> 1 Data race problem(s) detected </message> <nop/> </feedback>
The loop has a race condition on a[10], which, if run in parallel, can receive either integers 10 (first iteration) or 11 (tenth iteration).
Regards,
Gleison