Altera_Forum
Honored Contributor
8 years agoGlobal memory access 512 bit width constrain?
Hi
I'm building 2d fft for image processing from the design example provided by Altera, namely modify it to take advantage of Hermitian symmetric: use N/2 point fft to perform a N point real-to-complex fft transform. One problem that bothered me a while is that though it only need to do N/2 point fft, it actually produce N/2+1 output and that +1 is necessary for inverse transform, so in the transpose kernel I have to somehow output one more data each row (or 8 data each working-group) and that extra output will mess the whole performance up. The kernel originally was writing 8 float2 (that is 512 bits) to global memory, and I added more 8 float2 sets writing under different branches and it works fine, what really changes the structure is when I want to write more then 8 float2 (either to the same cl_buffer or different buffer), the store unit will be construct with different width and much more latency, 72 instead of 2 in my case, as you can see in the picture. dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B; dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
//dest2.x= buf.x - buf.y;//this two lines make all the difference
//dest2.y = 0;
// or this one:
//dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
https://alteraforum.com/forum/attachment.php?attachmentid=14454&stc=1 https://alteraforum.com/forum/attachment.php?attachmentid=14455&stc=1 Eventually I worked around it by using channels to passed the extra data to a new kernel and let it write to global memory. I can't find anything about this 512 width global memory access constrain or optimization in the documents, anyone know why the compiler is building the store units this way? Thanks.