Hi everybody and thanks for your help!
I have this piece of code :
unsigned char A,B,C; // init A,B,C with mm_malloc, 64 bit aligned for(j=0;j<size;j++) C[j] = fminf(255,255-(A[j]*B[j]));
Considering that A,B,C are 8 bit datatype so with AVX vectorization I should have 16 operation per clock cycle, but the function fmin work with 32 bit float datatype so the operation per clock cycle are 8. I see in Intel intrinsic function exist a min between u8 datatype.
I try to translate the loop in intrinsic but I have a problem to find a load and mul function to u8 packed datatype (epu8).
How can obtain the maximum performance in this loop?
Thanks
Best regards
Eric