Quantcast
Channel: Intel® Software - Intel® C++ Compiler
Viewing all 1175 articles
Browse latest View live

Calling convention (codegen) bug when returning packed struct

$
0
0

Hello

I recently came across what seems to be a bug when compiling the following code (designed to reproduce the bug):

intel_test.hpp:

#pragma once

#include <cstdint>

#pragma pack(push, 1)
struct packed_struct {
    std::uint32_t uint_val;
    std::uint8_t byte_val;
};
#pragma pack(pop)

packed_struct create_packed_struct(std::uint8_t byte_val);

intel_test.cpp:

#include "intel_test.hpp"

packed_struct create_packed_struct(std::uint8_t byte_val) {
    packed_struct result;
    result.uint_val = 0xbaadf00d;
    result.byte_val = byte_val;
    return result;
}

main.cpp:

#include <iostream>

#include "intel_test.hpp"

int main() {
    const auto packed_struct_val = create_packed_struct(0);
    std::cout << "packed struct: uint_val = "<< std::hex << packed_struct_val.uint_val << ", byte_val = "<< int{packed_struct_val.byte_val} << std::endl;
}

This code should result in the output:
"packed struct: uint_val = baadf00d, byte_val = 0"
which is what happens when compiling with Visual Studio 2015 or Intel Compiler 15 in release mode or Intel Compiler 17 in debug mode.

However when compiling with Intel Compiler 17 in release mode I get the following output:
"packed struct: uint_val = 3fb636a4, byte_val = 1".

When I viewed the disassembly I found that create_packed_struct() returns the whole struct packed into the RAX register:

mov         rax,0BAADF00Dh
movzx       r8d,dl
shl         r8,20h
or          rax,r8
ret

while the calling code expects the result to be written to memory on the stack pointed to by the RCX register:

xor         edx,edx
lea         rcx,[rbp+10h]
call        create_packed_struct (013F4B1000h)
mov         dl,byte ptr [rbp+14h]
mov         eax,dword ptr [rbp+10h]
mov         byte ptr [rbp+24h],dl  

And since it overwrites RAX after create_packed_struct() returns, the result is always whatever garbage was previously at RBP+10h.

Further testing showed that removing the "#pragma pack" directives fixes the problem (the caller correctly reads the result from RAX).
Adding the __regcall calling convention specifier to the declaration of create_packed_struct() also fixes the problem (the caller correctly reads the result from RAX).

For reference:

release compiler flags: /MP /GS /Zc:rvalueCast /W4 /QxCORE-AVX2 /Gy /Zc:wchar_t /Zi /O2 /Ob2 /GF /Zc:forScope /GR /arch:CORE-AVX2 /Oi /MD /EHsc /nologo /Gw /Zo /Qstd=c++14 /Qvc14
debug compiler flags: /MP /GS /Zc:rvalueCast /W4 /Gy /Zc:wchar_t /Zi /Od /Zc:forScope /RTC1 /GR /MDd /EHsc /nologo /Gw /Zo /Qstd=c++14 /Qvc14
linker flags: /MANIFEST /NXCOMPAT /DYNAMICBASE /DEBUG /MACHINE:X64 /OPT:REF /qnoipo /INCREMENTAL:NO /SUBSYSTEM:CONSOLE /OPT:ICF /NOLOGO /TLBID:1

 

Thread Topic: 

Bug Report

Apparent IPO compiler bug

$
0
0

I just encountered a very strange compiler (or linker?) bug, thought I should report it here. My description here is what I guess is happening... This is in icl 15.0.2.179 Build 20150121

I use a string - "Stream (via VLC)" - at multiple places. So in one place I used:
strcpy(some_variable, "Stream (via VLC)");

In another I now added:
strstr(some_other_variable, "Stream (via VLC)");

For the strcpy, the compiler sees that it can perform the full strcpy with just 2 SSE instructions (it's a 16 character/byte string). So, it loads the string into an SSE2 register and then stores it. This however requires that the string is aligned on a 16 byte address, and until now it always was.

But now I created that strstr line, which created another copy of this string which didn't have to be 16 byte aligned, and it wasn't.

So far so good. But in a final optimization step, the compiler checks if strings occur multiple times, and if so, it throws all of them away but one. And the one it kept wasn't 16-byte aligned. Which caused the code to crash in release mode (and the cause was only visible by looking at the generated assembly code).

I have a workaround (don't use a string in one of the two places but just set the array values separately), but this is something that could really drive people who don't understand how SSE works and can't read disassembly insane.

ICC 17.0.2 rejects valid C++ template function specialization

$
0
0

The following code is rejected with the message
test.cpp(19): error: declaration is incompatible with function template "void Foo::bar(Vec<, T>) [with N=0]" (declared at line 11) void Foo<0>::bar(Vec<0, T> x) {}

Compiled with: icpc -std=c++14 test.cpp
icpc --version
icpc (ICC) 17.0.2 20170213
Copyright (C) 1985-2017 Intel Corporation. All rights reserved.

constexpr int getN(int n) {
    return n;
};

template<int N, typename T>
struct Vec{ };

template<int N>
struct Foo
{
  template<typename T> void bar(Vec<getN(N), T> x);

  //// A workaround
  // static constexpr int n = getN(N);
  // template<typename T> void bar(Vec<n, T> x);
};

template<> template <typename T>
void Foo<0>::bar(Vec<0, T> x) {}

Thanks,
Matthias Hochsteger

 

Intel C++ compiler for academic non-commercial use

$
0
0

Hello,

about three years ago I could download the Intel Composer XE 2013 for free for academic non-commercial use. Now I can see that only some libraries (MKL, etc.) are free, but I have read in some places that the C++ compiler could be free for academia in the future. Is it true?

Code optimization fails

$
0
0

Hi everyone,

I recently ran into an issue I don't understand. I wrote an iterative flow solver that carries out some calculations in each, then calculates a residual and starts over if the residual is still too large. I moved the calculation part into a function that is called in each iteration. It looks like this;

for (int j = 1; j < (ny - 1); j++)
	{
		for (int i = boundLeft; i < (bruttoLength-boundRight); i++)
		{
			i0 = i + j * bruttoLength;

			utilde[i0] = 0;

			// differences in x-direction

			if (i == 1) {
				utilde[i0] = utilde[i0] - (-v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * u[i0 + 3];

				utilde[i0] = utilde[i0] - (v / (2 * a * dx) + 1 / (3 * pow(dx, 2))) * u[i0 + 2];

				utilde[i0] = utilde[i0] - (-3 * v / (2 * a*dx) + 1 / (2 * pow(dx, 2))) * u[i0 + 1];

				utilde[i0] = utilde[i0] - (v / (4 * a*dx) + 11 / (12 * pow(dx, 2))) * utilde[i0 - 1];

				own_share = own_share + (5 * v / (6 * a*dx) - 5 / (3 * pow(dx, 2)));
			}
			else
			{
				if (i == (bruttoLength - 2)) {

					utilde[i0] = utilde[i0] - (v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * utilde[i0 - 3];

					utilde[i0] = utilde[i0] - (-v / (2 * a * dx) + 1 / (3 * pow(dx, 2))) * utilde[i0 - 2];

					utilde[i0] = utilde[i0] - (3 * v / (2 * a * dx) + 1 / (2 * pow(dx, 2))) * utilde[i0 - 1];

					utilde[i0] = utilde[i0] - (-v / (4 * a * dx) + 11 / (12 * pow(dx, 2))) * u[i0 + 1];

					own_share = own_share + (-5 * v / (6 * a*dx) - 5 / (3 * pow(dx, 2)));
				}
				else
				{
					utilde[i0] = utilde[i0] - (v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * u[i0 + 2];

					utilde[i0] = utilde[i0] - (-2 * v / (3 * a * dx) + 4 / (3 * pow(dx, 2))) * u[i0 + 1];

					utilde[i0] = utilde[i0] - (2 * v / (3 * a * dx) + 4 / (3 * pow(dx, 2))) * utilde[i0 - 1];

					utilde[i0] = utilde[i0] - (-v / (12 * a * dx) - 1 / (12 * pow(dx, 2))) * utilde[i0 - 2];

					own_share = own_share + (-5 / (2 * pow(dx, 2)));
				}
			}

			// repeat equivalent code for y-direction. It has the same structure as above

			// SOR-Share
			utilde[i0] = utilde[i0] / own_share * w + (1 - w) * u[i0];
			own_share = 0;
		}
	}

As long as the code is executed as a function call (arguments are 3 integers and 2 double pointers) the performance is really bad. As soon as I copy the code directly into my loop there is a massive speed up.

I tried both versions with and without the code optimization enabled (/O2) and measured the average execution time of the code snippet above. It looks like there is only minor code optimization for the version with the function call as the execution time did not improve much (3x faster, compared to 12x faster withou the function call).

I'm not sure if this is the root of the problem though. Can anybody give me some advise? Of course I could leave the whole calculation part inside my while-loop, but that looks very confusing. It would be much clearer to move it into a separate function.

I'm using the compiler that comes with Intel Parallel Studio XE 2017.

Best regards.

Thread Topic: 

Question

__builtin_clrsbl undefined

$
0
0

__builtin_clrsb and __builtin_clrsbll exist, but __builtin_clrsbl seems to be missing.  Quick test:

#include <stdlib.h>
#include <stdio.h>

int main (void) {
  printf("int: %d\n", __builtin_clrsb(-1));
  printf("long: %d\n", __builtin_clrsbl(-1));
  printf("long long: %d\n", __builtin_clrsbll(-1));

  return EXIT_SUCCESS;
}

When attempting to compile:

nemequ@peltast:~/t$ icc -o clrsb clrsb.c
clrsb.c(6): warning #266: function "__builtin_clrsbl" declared implicitly
    printf("long: %d\n", __builtin_clrsbl(-1));
                         ^

/tmp/icc9KCY5p.o: In function `main':
clrsb.c:(.text+0x43): undefined reference to `__builtin_clrsbl'
nemequ@peltast:~/t$ icc --version
icc (ICC) 17.0.2 20170213
Copyright (C) 1985-2017 Intel Corporation.  All rights reserved.

Thread Topic: 

Bug Report

__STDC_NO_THREADS__ undefined

$
0
0

ICC defines __STDC_VERSION__ to 201112L, but doesn't define __STDC_NO_THREADS__.  glibc doesn't currently support the C11 threads API, so it should be defined (per § 6.10.8.3 of the C11 spec).

I know this is partially a libc problem.  I believe GCC resolves this by including <stdc-predef.h> from glibc (which includes a definition of __STDC_NO_THREADS__ in glibc >= __STDC_NO_THREADS__).

I'm working around this right now by checking __STDC_NO_THREADS__ after including <limits.h> (<limits.h> includes <features.h> which includes <stdc-predef.h>).

Testing is simple, just put this before any includes:

#if defined(__STDC_VERSION__) && (__STDC_VERSION__ >= 201102L) && !defined(__STDC_NO_THREADS__)
#  include <threads.h>
#endif

 

Thread Topic: 

Bug Report

How to specify __mm512 to coreside with __mm256 or _mm128

$
0
0

In working with the intrinsics guide, and as for use with AVX512 one of the intrinsics I am wanting to use is

__m512d _mm512_broadcastsd_pd(__m128d a)

However, I'd like to avoid using a mov to move data from register to register, I'd rather use a cast.

The __m128 registers are co-resident with the low 8 __m512 registers. So, what I am asking for is

__m512d foo compilerDirective(use a register in range of 0:7);

Or one to specifically specify the zmm register to use (in range of 0:7). (I'd rather not regress to assembler)

There would be a similar issue with specifying zmm to overlay with ymm.

Jim Dempsey


omp_get_num_procs() doesn't returns all processor

$
0
0

Hi, I have an application in c++ that is using Qt libraries.

When I call omp_get_num_procs() from any part of the program, doesn't return the maximum number of processors that my machine has, so all the threads are distributed in the available processors.

But, If I call  omp_get_num_procs() from main.cpp, before to QApplication constructor, I obtain all processor and the thread are distributed in all processors that the machine has.

I've tried to find out what exactly does  omp_get_num_procs() that is changing the available processors fro the application.

Best Regards,

Leonardo

Zone: 

Thread Topic: 

Question

Linking issue with Intel Debug build ...

$
0
0

Hello,

I get the following linking error when building a debug build with Intel 16.0 compiler:

 error LNK2019: unresolved external symbol ___intel_ssse3_strncpy referenced in function

Does anyone know the corresponding lib where this symbol resides?  

Many thanks,

Andrew.

ps>  Using windows with VS2013, Intel Compiler 16.0.

icpc HUNGS on small program

$
0
0

Hi!

I found icpc from Intel Parallel Studio XE 2017 hungs on my Linux machine compiling my project. For some tries, I reduced my code to the following:

#include <list>

class Outer {
public:
	class InnerBase {
	public:
	};

	class InnerDerrive : public InnerBase {} ;

	struct Item {
		InnerBase* x;
	};

private:
	InnerDerrive obj;
	using List = std::list<Item>;
	List saversMap{ { Item{&obj } } };

public:
};

compiling with /opt/intel/bin/icpc -std=c++14 -c file.cpp

If I replace InnerBase with InnerDerrive at line 12, the compiler will work properly.

Please, tell, is it Intel Compiler bug and if so, how to report it to Intel?

_GFX_offload weird behaviour

$
0
0

Hi,

I'm targeting Intel Graphics Technology with the API-Based offloading for asynchronous offloading. To begin, I try to offload this algorithm :

for (int i = 0; i < size; i++){
  A[i] = i;
}

So I wrote this code :

__declspec(target(gfx_kernel))
void fill(int * A, int size){
  _Cilk_for(int i = 0; i < size; i++){
    A[i] = i;
  }
}

int main() {
  int N = 1024;
  int * A = malloc(sizeof(int) * N);

  _GFX_share(A,N);

  _GFX_offload((void*)fill, A, N);
  _GFX_wait(0,-1);

  _GFX_unshare(A);
  free(A)

  return 0;
}

This code compiles and executes, but only the 780 firsts elements of A are effectively changed. I guess that's because of the max value of groups and threads but the number seems weird to me (_GFX_get_device_hardware_thread_count() returns 336).

So I have two questions : why 780 ? and how can I write a kernel that I can call with

_GFX_offload((void *)fill, A, N);

that does what I want it to do ?

Thanks, and have a nice day

Mathieu

Thread Topic: 

Help Me

Compiler doesn't vectorize even with simd directive

$
0
0

I have this function taken from [here][1]:

    bool interpolate(const Mat &im, float ofsx, float ofsy, float a11, float a12, float a21, float a22, Mat &res)
    {
       bool ret = false;
       // input size (-1 for the safe bilinear interpolation)
       const int width = im.cols-1;
       const int height = im.rows-1;
       // output size
       const int halfWidth  = res.cols >> 1;
       const int halfHeight = res.rows >> 1;
       float *out = res.ptr<float>(0);
       for (int j=-halfHeight; j<=halfHeight; ++j)
       {
          const float rx = ofsx + j * a12;
          const float ry = ofsy + j * a22;
          for(int i=-halfWidth; i<=halfWidth; ++i)
          {
             float wx = rx + i * a11;
             float wy = ry + i * a21;
             const int x = (int) floor(wx);
             const int y = (int) floor(wy);
             if (x >= 0 && y >= 0 && x < width && y < height)
             {
                // compute weights
                wx -= x; wy -= y;
                // bilinear interpolation
                *out++ =
                   (1.0f - wy) * ((1.0f - wx) * im.at<float>(y,x)   + wx * im.at<float>(y,x+1)) +
                   (       wy) * ((1.0f - wx) * im.at<float>(y+1,x) + wx * im.at<float>(y+1,x+1));
             } else {
                *out++ = 0;
                ret =  true; // touching boundary of the input
             }
          }
       }
       return ret;
    }

As suggested by [Intel Advisor][2], I added:

    #pragma omp simd
    for(int i=-halfWidth; i<=halfWidth; ++i)

However, while compiling I got:

    warning #15552: loop was not vectorized with "simd"

Googling it, I found [this][3], but it's still not clear to me how I could solve this and vectorize this loop. 

  [1]: https://github.com/perdoch/hesaff/blob/master/helpers.cpp
  [2]: https://www.google.it/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&u...
  [3]: https://software.intel.com/en-us/articles/fdiag13379

Problem integrate Intel C++ compiler with IDE

$
0
0

Hello,

I'm trying to integrate intel c++ compiler with either Xcode or Eclipse on Mac OS. However, neither intel c++ compiler show up in my Xcode build rules settings nor Eclipse extension exist in my installation directory. 

My Mac OS version is 10.12.4, Xcode version is 8.3. My installed intel version is parallel studio composer 2017.

Is there a compatible problem between my Intel compiler and system?

 

 

Thread Topic: 

Question

"typedef double v4df __attribute__((vector_size(32)));" with OpenMP makes strange output.

$
0
0

Hello,

"typedef double v4df __attribute__((vector_size(32)));" with OpenMP makes strange output.

Please decompress the attached file and do "make, make test", then you will find the following.

"icpc & -DV4DF=auto",  "g++ & -DV4DF=auto", "g++ & -DV4DF=v4df" make right output.

On the other hand, "icpc & -DV4DF=v4df" make wrong output.

Thank you

AttachmentSize
Downloadapplication/x-gziptest_program.tar.gz2.44 KB

Thread Topic: 

Bug Report

Build fails using ICC but not GCC

$
0
0

Hi all,

I am building ITK from source. I have had the same issue on Centos 7 and Ubuntu 14.04 systems running on Xeon Broadwell systems.

Building with GCC

When I use cmake to configure for a standard build using gcc, it use this method below. The compilation completes and works well.

mkdir build; cd build;
cmake -D Module_PerformanceBenchmarking:BOOL=ON ..
make 

Building with ICC (latest release 2017; update 2 16 Feb 2017)

When I use cmake to configure for a standard build using Intel compiler, it use this method below. The compilation completes and works well.

mkdir build; cd build;
CC=/opt/bin/icc CXX=/opt/bin/icc cmake -DCMAKE_CXX_COMPILER:FILEPATH=/opt/bin/icc -DCMAKE_C_COMPILER:FILEPATH=/opt/bin/icc -DModule_PerformanceBenchmarking:BOOL=ON ..
cmake -D Module_PerformanceBenchmarking:BOOL=ON ..
make 

There is this problem with Centos 7

Scanning dependencies of target ITKTransform-all
[ 58%] Built target ITKTransform-all
[ 58%] Building CXX object Modules/Core/ImageFunction/test/CMakeFiles/ITKImageFunctionTestDriver.dir/itkRayCastInterpolateImageFunctionTest.cxx.o
/src/ITK/Modules/Core/ImageFunction/include/itkRayCastInterpolateImageFunction.hxx(38): error: member "<unnamed>::RayCastHelper<TInputImage, TCoordRep>::InputImageDimension [with TInputImage=itk::Image<unsigned char, 3U>, TCoordRep=double]" was referenced but not defined
    itkStaticConstMacro(InputImageDimension, unsigned int,
    ^

compilation aborted for /src/ITK/Modules/Core/ImageFunction/test/itkRayCastInterpolateImageFunctionTest.cxx (code 2)
make[2]: *** [Modules/Core/ImageFunction/test/CMakeFiles/ITKImageFunctionTestDriver.dir/itkRayCastInterpolateImageFunctionTest.cxx.o] Error 2
make[1]: *** [Modules/Core/ImageFunction/test/CMakeFiles/ITKImageFunctionTestDriver.dir/all] Error 2
make: *** [all] Error 2

I get this problem when I use Ubuntu 14.04

Building CXX object Modules/Filtering/LabelMap/test/CMakeFiles/ITKLabelMapTestDriver.dir/itkBinaryImageToLabelMapFilterTest2.cxx.o
[ 68%] icc: error #10106: Fatal error in /opt/compilers_and_libraries_2017.2.174/linux/bin/intel64/mcpcom, terminated by kill signal
compilation aborted for /path/ITK/ITK-IntelCompiler/build/Modules/Segmentation/SignedDistanceFunction/test/ITKSignedDistanceFunctionTestDriver.cxx (code 1)
icc: error #10106: Fatal error in /opt/compilers_and_libraries_2017.2.174/linux/bin/intel64/mcpcom, terminated by kill signal
compilation aborted for /path/ITK/ITK-IntelCompiler/build/Modules/IO/BioRad/test/ITKIOBioRadTestDriver.cxx (code 1)
make[2]: *** [Modules/IO/XML/test/CMakeFiles/ITKIOXMLTestDriver.dir/ITKIOXMLTestDriver.cxx.o] Error 1

I need help getting past this issue. Thank you.

Thread Topic: 

Help Me

C++14 recursive lambda missing operator()

$
0
0
void foo() {
  auto bar = [](auto& self) -> void {
    return self(self);
  };
  bar(bar);
}

 

icc 17.0.2 produces the following error:

main.cc(3): error: call of an object of a class type without appropriate operator() or conversion functions to pointer-to-function type
      return self(self);
             ^
          detected during instantiation of function "lambda [](auto &)->void [with <auto-1>=lambda [](auto &)->void]" at line 5

 

Thread Topic: 

Bug Report

Using SVM for Intel Graphics Technology

$
0
0

Hi,

I'm writing a benchmark to compare different technologies and their performance accross various platforms (on Linux). One of the platform is an Intel Broadwell-H  (Core i7-5775C with integrated GPU Iris Pro 6200), so I'm testing the various ways to offload a code on my GPU using Cilk Plus. Right now, I'm trying to use SVM so I followed this tutorial but I'm facing some problems. Here's my code :

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cilk/cilk.h>
#include <gfx/gfx_rt.h>

#define SIZE 64

int main(){
  int * in = (int*)_GFX_svm_alloc(sizeof(int)*SIZE);

#pragma offload target(gfx)
  _Cilk_for (int i = 0; i < SIZE; i++){
    in[i] = 1;
  }

  for (int i = 0; i < SIZE; i++){
    assert(in[i] == 1);
  }

  _GFX_svm_free(in);
  return 0;
}

Then I compile with

 - $ icc -qoffload-svm test.c
test.c(12): error: *GFX* pointer variable "in" in this offload region must be specified in an in/out/inout/nocpy clause
  #pragma offload target(gfx)
  ^

compilation aborted for test.c (code 2)

I thought maybe SVM is not allowed on all patforms, so compiled with

- $ icc -qoffload-arch=broadwell -qoffload-svm test.c
- $ ./a.out
libva info: VA-API version 0.99.0
libva info: va_getDriverName() returns 0
libva info: User requested driver 'iHD'
libva info: Trying to open /opt/intel/mediasdk/lib64/iHD_drv_video.so
libva info: Found init function __vaDriverInit_0_32
libva info: va_openDriver() returns 0
a.out test.c:18: main: Assertion 'in[i] == 1' failed.
Abandon (core dumped)

So I guess specifying the platform helps to compile, but the execution failed.

If I change the pragma adding inout(in : length(SIZE)), the compilation/execution works well with the first one, but with the second one we have the same execution problem. The point is : I don't want to add the inout clause, I shouldn't have to. I assume my compilation line is wrong but I can't say in which way.

So my question is : do you see something wrong in my code/compilation ?

Thanks a lot for your time,

Mathieu

crash using GCC style vector types

$
0
0

I'm seeing a strange compiler crash when using GCC style vector types. I'm using vector types to load/store because using _mm_loadu_si128 to load vectors seems to ignore the restrict qualifier, causing constant values to be reloaded from memory. I'll make another post for that. My function calculates the mean & stdev. of an array. The clever part is reusing the loop body to process the remainder to reduce the cache foot print. But it seems it's this complex control flow that's causing the crash. If I comment out the goto handleRemainder or comment out the inner most do-while loop, it compiles. And of course, using __m128i instead of vector types makes the crash go away. The crash happens in both ICC 14 & the latest ICC 17. Appreciate a reasonable workaround, explanation, or patch.

#include <immintrin.h>
#include <stdint.h>
#include <unistd.h>
#include <math.h>
#include <algorithm>
using namespace std;

#define CAST_VSHORT(x) x
#define ROUND_DOWN(a, b) (a & (~(b - 1)))
#define MAX_INTENSITY 4096
#define FORCE_INLINE inline __attribute__ ((always_inline))

#if 1
// crashes
typedef int32_t __attribute__((vector_size(16))) VINT;
typedef int16_t __attribute__((vector_size(16))) VSHORT;
typedef int16_t __attribute__((vector_size(16), aligned(1))) UNALIGNED_VSHORT;
#else
typedef __m128i VINT;
typedef __m128i VSHORT;
#endif

FORCE_INLINE __m128i PartialVectorMask(ssize_t n)
{
  return _mm_set1_epi16(0xffff);   // incomplete for brevity
}

FORCE_INLINE int64_t VectorSum(VINT x)
{
  __m128i lo = _mm_cvtepi32_epi64(x),
          hi = _mm_cvtepi32_epi64(_mm_srli_si128(x, 8));
  __m128i sum = _mm_add_epi64(lo, hi);
  return _mm_extract_epi64(_mm_add_epi64(sum, _mm_srli_si128(sum, 8)), 0);
}

__m128i
void CalculateMeanAndStdev(float &mean, float &stdev,
                           int16_t *in, ssize_t size)
{
  ssize_t i;
  double sum = 0, squareSum = 0;
  VINT zero = _mm_set1_epi32(0),
    vSquareSum = zero,
    vSum = zero;
    VSHORT data;
    ssize_t blockEnd;
    const ssize_t VECTOR_WIDTH = 8;
    // elements you can accumulate before square sum can overflow
    const ssize_t BLOCK_SIZE = ROUND_DOWN((UINT32_MAX / ((MAX_INTENSITY - 1) * (MAX_INTENSITY - 1))) * 4, VECTOR_WIDTH);
    ssize_t roundedSize = ROUND_DOWN(size, VECTOR_WIDTH);
    for (i = 0; i <= size - VECTOR_WIDTH; )
    {
      blockEnd = min(i + BLOCK_SIZE, roundedSize);
      // process a block whos size is a multiple of 8, except when processing the SIMD remainder
      do
      {
          data = _mm_loadu_si128((__m128i *)&in[i]);
          //data = *(UNALIGNED_VSHORT *)&in[i];
      handleRemainder:
          VINT unpacked0 = _mm_srai_epi32(_mm_unpacklo_epi16(data, data), 16),
               unpacked1 = _mm_srai_epi32(_mm_unpackhi_epi16(data, data), 16);

          vSquareSum = _mm_add_epi32(_mm_madd_epi16(data, data), vSquareSum);
          vSum = _mm_add_epi32(unpacked0, vSum);
          vSum = _mm_add_epi32(unpacked1, vSum);
          i += VECTOR_WIDTH;
      } while (i < blockEnd);

      squareSum += VectorSum(vSquareSum);
      sum += VectorSum(vSum);
      vSum = zero;
      vSquareSum = zero;
  }
  if (i < size)
  {
      // handle remainder by setting invalid elements to 0
      data = _mm_and_si128(_mm_loadu_si128((__m128i *)&in[i]), PartialVectorMask((size % VECTOR_WIDTH) * sizeof(int16_t)));
      blockEnd = size;
      goto handleRemainder;     // share code to reduce machine code size
  }
  mean = sum / size;
  stdev = sqrtf((squareSum - sum * sum / size) / (size - 1));
}

int main()
{
  const size_t N = 4096;
  int16_t __attribute__((aligned(16))) image[N];
  float mean, stdev;
  for (int i = 0; i < 1000000; ++i)
  {
    CalculateMeanAndStdev(mean, stdev, image, N);
  }
  return mean;
}

Zone: 

Thread Topic: 

Bug Report

Intel C++ compiler 2017 MPI 64-bit C++ support

$
0
0

Hey,

Is there any way to get 64-bit support for the Intel MPI libraries in 2017 for C++ applications. It used to be part of the 2016 package and it was removed. I need some of the newer features of many of the other packages provided by Cluster XE but the MPI in my opinion has gone through a downgrade.

Thanks,

Will

Viewing all 1175 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>