Global memory access 512 bit width constrain?

Question

Hi  I'm building 2d fft for image processing from the design example provided by Altera,  namely modify it to take advantage of Hermitian symmetric: use N/2 point fft to perform a N point real-to-complex fft transform.  One problem that bothered me a while is that though it only need to do N/2 point fft, it actually produce N/2+1 output and that +1 is necessary for inverse transform,  so in the transpose kernel I have to somehow output one more data each row (or 8 data each working-group) and that extra output will mess the whole performance up.  The kernel originally was writing 8 float2 (that is 512 bits) to global memory, and I added more 8 float2 sets writing under different branches and it works fine,  what really changes the structure is when I want to write more then 8 float2 (either to the same cl_buffer or different buffer), the store unit will be construct with different width and much more latency, 72 instead of 2 in my case, as you can see in the picture.          dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;        dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
        dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
        dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
        dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
        dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
        dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
        dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
        dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
        dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
        dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
        dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
        dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
        dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
        dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
        dest.y = buf.y*A + buf.x*A + buf.x*B - buf.y*B;
        //dest2.x= buf.x - buf.y;//this two lines make all the difference
        //dest2.y = 0;
                // or this one:
        //dest.x = buf.x*A - buf.y*A + buf.x*B + buf.y*B;
  https://alteraforum.com/forum/attachment.php?attachmentid=14454&amp;stc=1  https://alteraforum.com/forum/attachment.php?attachmentid=14455&amp;stc=1  Eventually I worked around it by using channels to passed the extra data to a new kernel and let it write to global memory. I can't find anything about this 512 width global memory access constrain or optimization in the documents, anyone know why the compiler is building the store units this way? Thanks.

altera_forum · Answer

Images attached to posts in the forum seem to be automatically shrinked and compressed; it is impossible to see anything in your image. Can you post it somewhere else? Or better yet, attach the complete "report" folder?

Furthermore, I am not sure if I understand what your problem is; are you wondering why load/store units which are larger than 512 bits incur higher latency?

altera_forum · Answer

Hi HRZ  I made the picture larger. It's not all case that write to global with more than 512 bits will this situation occur, e.g. without the if else branch. You can try to compile the code below and see the difference in the system viewer when line in question is marked out. Thank you.   __kernel void test(global float2 *restrict dest, global float2 
*restrict in, int i)
{
local float2 buf;
int where = get_local_id(0);
int N=64;
        buf = in;
        buf = in;
        buf = in;
        buf = in;
        buf = in;
        buf = in;
        buf = in;
        buf = in;
    if (i){
            dest = buf;
            dest = buf;
            dest = buf;
            dest = buf;
            dest = buf;
            dest = buf;
            dest = buf;
            dest = buf;
            dest = buf; // this line
        }
    else     {
            dest = buf;
            dest = buf;
            dest = buf;
            dest = buf;
            dest = buf;
            dest = buf;
            dest = buf;
            dest = buf;
    }
}

altera_forum · Answer

I don't see anything out of the ordinary in the report from your sample code. The compiler creates a 512-bit coalesced load from global memory, and two stores, one of of which is 512 bits wide and the other is 64 bits; since the size of global memory ports must be a power of two, the compiler is deciding that it is best if your 9 consecutive stores are split into one big and one small store, instead of a bigger 1024-bit store (which will waste a lot of memory bandwidth). This decision seems correct to me. Furthermore, the compiler is combining your stores from the if and the else, since the write addresses are the same and only the data is different; hence, the compiler can just instantiate a multiplexer to send the correct data to memory, instead of creating extra memory ports.

Regarding latency, I am not seeing any specific difference. You are not comparing the latency from the "white" store unit which belong to your local buffer, with the "blue" store units from the global buffer, are you?

Finally, you should note that the actual latency of accesses to/from global memory is over 100 cycles; the latency the compiler reports for these accesses only depends on the number of extra registers the compiler inserts on the way to the memory port to absorb stalls, and does not reflect the real latency of the accesses. If the accesses finishes in less clocks than there are registers on its way, the pipeline will not be stalled (but some bubbles might be inserted). However, if the access takes longer, then the pipeline will stall. At the end of the day, having more registers on the way of global memory accesses will be beneficial since it allows absorbing more stalls, but will come at the cost of higher area usage.

altera_forum · Answer

Hi HRZ

Thank you for the reply.

I'm compiling for Arria 10 and I did see significant different (2 vs 50+2 as marked in the picture), and it did reflect latency when running on FPGA (512*512 2d FFT 0.8ms vs 2.5ms).

https://imgur.com/a/7eqlr

It's what the compiler think is best for you but in actual practice it's not optimal and programmer have to deal with it...:o

altera_forum · Answer

The new image you have posted looks completely different from what I got from compiling your original kernel. In the new case, the reason for the slow-down is not the difference in latency, but rather the fact that now, instead of one read and one write port going to external memory, you have one read and 9 writes, all of which will be competing with each other to obtain access to the memory bus. This will result in a very high amount of contention and very frequent stalls in memory accesses which will get propagated all the way down to the pipeline. If this is one of those cases that the compiler is failing to coalesce the accesses, even though they are consecutive, then, yes, the compiler is making a mistake here (I reported one such case to Altera long ago). If not, you should modify your kernel to minimize the number of write ports.

Unless your input is so small that the pipeline is not filled before execution finishes, the "latency" of the pipeline will not have a noticeable effect on run time.

Forum Discussion

Global memory access 512 bit width constrain?

10 Replies

Recent Discussions

Need a license for Encrypting - Quartus Prime Lite

Agilex 5 – Critical HSSI Error in JESD204B Example Design

recovery timing issue

Once again about CTRL+L

timing signoff