Hello,
I have a simple loop as following:
#pragma omp parallel for
__assume_aligned(mO, 32);
__assume(numColsPad % 32 == 0);
__assume_aligned(vO, 32);
for (ii = 0; ii < numRows; ii++)
{
memcpy(&mO[ii * numColsPad], vO, numCols * sizeof(float));
}Though I tell the compiler all information needed to use the optimized memcpy it complains the destination should be aligned.
Is there a way to tell it that mO[ii * numColsPad] is aligned for any ii within any thread?