I have downloaded a 2010 paper by Fritz Gerneth, FIR Filter Algorithm Implementation Using Intel SSE Instructions, targeted for the Atom.
In page 4, there is a brief description of the sum to be implemented and vectorized. The code is as follows:
for ( j = 0; j < 640; j++ ) {
int s = 0; // s = accumulator
for ( i =0; i <= 63; i++ )
s += c[i] * x[i + j]; // x[] = input values, c[] = filter coefficients
y[j] = s; // y[] = output values
}
When j is at the last iteration (639) and i is in the second (1), the index in x[i + j] will overflow, as the text says it is 640 input elements that will be filtered. By the time i is in its last iteration (63), we will have x[63 + 639], which is clearly broken.