Forum Discussion

PRavi7's avatar
PRavi7
Icon for New Contributor rankNew Contributor
6 years ago

Eviction policy of burst-coalesced cached non-aligned LSU's cache

Hi,

I am implementing an application using OpenCL targeting Intel Arria 10 GX 1150.

typedef struct {
    char data[8];
} block;
block buf;
while(true) {
    global_ptr = some_complex_address_calculation;
    #pragma unroll
    for(int i = 0; i < 8; i++) {
        buf.data[i] = global_ptr[i];
}

I'm performing 8 consecutive reads each of which is a struct with 8 bytes. The compiler converts this to a 64 byte DDR read. Since all the iterations of the unrolled loop perform consecutive accesses, the compiler is implementing a burst-coalesced LSU. The some_complex_address_calculation is such that the application will see pretty good data locality and temporal locality. I find that the default cache which comes with burst-coalesced cached LSU isn't as efficient due to various reasons unknown to me.

I'll appreciate it very much if you can provide more information about the following w.r.t. burst-coalesced cached LSU:

1) What is the size of cache line?

2) What is the eviction policy?

3) Does this cache load n+1th block data when a read request for nth block is issued? (please note that block here is 64 bytes, so when iteration n of while loop is reading nth data block, can the cache pre-request n+1th block?)

Thanks in advance

3 Replies

  • KhaiChein_Y_Intel's avatar
    KhaiChein_Y_Intel
    Icon for Regular Contributor rankRegular Contributor

    Hi,

    The cache size depends on the memory size that you would like to access. If you are using the built-in calls that you can use for loading from and storing to global memory, you have to define using the Argument #2 of the Load Built-in. May I know which eviction policy you are referring to?

    Thanks.

  • HRZ's avatar
    HRZ
    Icon for Frequent Contributor rankFrequent Contributor

    The details of the cache are not documented anywhere; however, in my experience:

    1- The cache line size is equal to the size of the coalesced memory port. Moreover, by default the cache has 512 or 1024 lines (don't remember exactly since nowadays, I always disable the cache to prevent it from wasting precious Block RAMs).

    2- It is probably something extremely simple like FIFO or LIFO. Best case scenario LRU.

    3- I am pretty sure the cache doesn't pre-load anything.

    In reality, exploiting your application's locality manually will always be more effective and efficient than relying on the extremely simple cache the compiler creates.