Yes. I still have incorrect output and I am not using the pragma ivdep. After a debug session, I have found out that the problem is in the following code.
for (uchar j=0; j<8; j++) {
k = j*16;
p = j*16+128;
h = j*16+256;
t = j*16+384;
for (uchar i=0; i<16; i+=2) {
result2[k+i] = (((result[p+i+1]*tw256[k+i+1] + result[p+i]*tw256[k+i] + result[h+i+1]*tw256[p+i+1] + result[h+i]*tw256[p+i] + result[t+i+1]*tw256[h+i+1] + result[t+i]*tw256[h+i]) >> 15) + result[k+i])>>1;
result2[k+i+1] = (((result[p+i+1]*tw256[k+i] - result[p+i]*tw256[k+i+1] + result[h+i+1]*tw256[p+i] - result[h+i]*tw256[p+i+1] + result[t+i+1]*tw256[h+i] - result[t+i]*tw256[h+i+1]) >> 15) + result[k+i+1])>>1;
result2[p+i] = (((result[t+i+1]*tw256[h+i] - result[t+i]*tw256[h+i+1] - result[h+i+1]*tw256[p+i+1] - result[h+i]*tw256[p+i] - result[p+i+1]*tw256[k+i] + result[p+i]*tw256[k+i+1]) >> 15) + result[k+i])>>1;
result2[p+i+1] = (((result[p+i+1]*tw256[k+i+1] + result[p+i]*tw256[k+i] - result[h+i+1]*tw256[p+i] + result[h+i]*tw256[p+i+1] - result[t+i+1]*tw256[h+i+1] - result[t+i]*tw256[h+i]) >> 15) + result[k+i+1])>>1;
result2[h+i] = (((result[h+i+1]*tw256[p+i+1] + result[h+i]*tw256[p+i] - result[t+i+1]*tw256[h+i+1] - result[t+i]*tw256[h+i] - result[p+i+1]*tw256[k+i+1] - result[p+i]*tw256[k+i]) >> 15) + result[k+i])>>1;
result2[h+i+1] = (((result[h+i+1]*tw256[p+i] - result[h+i]*tw256[p+i+1] - result[t+i+1]*tw256[h+i] + result[t+i]*tw256[h+i+1] - result[p+i+1]*tw256[k+i] + result[p+i]*tw256[k+i+1]) >> 15) + result[k+i+1])>>1;
result2[t+i] = (((result[p+i+1]*tw256[k+i] - result[p+i]*tw256[k+i+1] - result[h+i+1]*tw256[p+i+1] - result[h+i]*tw256[p+i] - result[t+i+1]*tw256[h+i] + result[t+i]*tw256[h+i+1]) >> 15) + result[k+i])>>1;
result2[t+i+1] = (((result[t+i+1]*tw256[h+i+1] + result[t+i]*tw256[h+i] - result[h+i+1]*tw256[p+i] + result[h+i]*tw256[p+i+1] - result[p+i+1]*tw256[k+i+1] - result[p+i]*tw256[k+i]) >> 15) + result[k+i+1])>>1;
}
}
The buffer "result" is correct and it is calculated previously in the code. The buffer "tw256" is constant and read-only. The variables k, p, h, t are initialized. The buffer "result2" is incorrect. Both "result" and "result2" are declared as registers (512 registers of width 16 and depth 1). I also tried to divide the "result2" writings using several unrolled loops. The output has less errors but still wrong anyway.
Could you please let me know if something is wrong? I can share the complete code if it helps.
Thank you for your support.