Hello,
I have a critical hotspot loop which I must ensure it vectorizes but using the verbose vectorization "-vec-report5" output it turns out that it is not vectorizing due to memory aliasing and function calling.
The most relevant part of the loop is included in the snippet below. A few remarks:
- The loop is unrolled to block sizes up to nb=16 but I included only the nb=2 because it is easier to read, the nb=16 is super large.
- The matrix elements are in column-major format.
- The element access r(i, j) is an inline function, and r is defined as tmatrix& r = *this; the operator() definition is also included below.
- The only foreign function call is MKL's dlartgp which generates a Givens rotation with higher precision that ensures non-negativity of the diagonal elements i.e. important down-dating a Cholesky factorization. Does this function call invalidates vectorization? any preprocessor instructions to circumvent this?
- The matrix m_data is allocated to be PAGE ALIGNED and therefore AVX, SSE aligned. Do I have to ensure that the first element accessed by the loop is AVX aligned starting at that point?
- The highest reuse is given by the innermost loop (applying the Givens rotations to trailing columns) and where the blocking actually pays off due to better locality since for NB=16 the higher the chance that the accessed elements (different rows of the same column) will be loaded in the cache line.
- In my blocking implementation I don't do peeling because handling the boundaries incurrs in high cache misses and low performance (seen this with VTune) instead I pad the matrix taking into account this triangularize operation and let it overflow (the overflow doesn't touch valid upper triangular elements).
Matrix type definition:
template class tmatrix { protected: // basic matrix structure T* __restrict m_data; int m_rows, m_cols; inline T& elem(int i, int j) { return m_data[j*m_rows + i]; } inline const T& elem(int i, int j) const { return m_data[j*m_rows + i]; } inline T& operator()(int i, int j) { return elem(i, j); } inline const T& operator()(int i, int j) const { return elem(i, j); }
Loop I would like to vectorize (outer loop starting at line 25):
tmatrix& r = *this; const int m = r.rows(); const int n = r.cols(); double x00, x01, x10, x11, x21; double xx0, xx1, xx2; double u00, u01, u10, u11, u21; double uu0, uu1, uu2; double c0, c1, s0, s1, d; int im1, ip1; int jp1; int nb = 2; assert(nb == TRIANGULARIZE_NB); int n_iter = (n - begin) % nb == 0 ? (n - begin) / nb : (n - begin) / nb + 1; int nb_end = begin + n_iter*nb; int i = begin - block + 2; int j = begin; // blocked step for (int k = 0; k < n_iter; ++k, i += nb, j += nb) { im1 = i - 1; ip1 = i + 1; jp1 = j + 1; // make the stencil as easy as possible x00 = r(im1, j); x01 = r(im1, jp1); x10 = r(i, j); x11 = r(i, jp1); x21 = r(ip1, jp1); // make nb steps ahead using registers only dlartgp(&x00, &x10, &c0, &s0, &d); u00 = c0*x00 + s0*x10; u10 = c0*x10 - s0*x00; u01 = c0*x01 + s0*x11; u11 = c0*x11 - s0*x01; x11 = u11; dlartgp(&x11, &x21, &c1, &s1, &d); u11 = c1*x11 + s1*x21; u21 = c1*x21 - s1*x11; r(im1, j) = u00; r(im1, jp1) = u01; r(i, j) = u10; r(i, jp1) = u11; r(ip1, jp1) = u21; #pragma omp parallel for schedule(runtime) private(xx0, xx1, xx2, uu0, uu1, uu2) for (int jj = (j + nb); jj < n; jj++) { xx0 = r(im1, jj); xx1 = r(i , jj); xx2 = r(ip1, jj); uu0 = c0*xx0 + s0*xx1; uu1 = c0*xx1 - s0*xx0; xx1 = uu1; uu1 = c1*xx1 + s1*xx2; uu2 = c1*xx2 - s1*xx1; r(im1, jj) = uu0; r(i , jj) = uu1; r(ip1, jj) = uu2; } }Thanks in advance,
Best regards,
Giovanni