Eviction policy of burst-coalesced cached non-aligned LSU's cache

Question

Hi,I am implementing an application using OpenCL targeting Intel Arria 10 GX 1150. typedef struct {
    char data[8];
} block;
block buf;
while(true) {
    global_ptr = some_complex_address_calculation;
    #pragma unroll
    for(int i = 0; i &lt; 8; i++) {
        buf.data[i] = global_ptr[i];
}I'm performing 8 consecutive reads each of which is a struct with 8 bytes. The compiler converts this to a 64 byte DDR read. Since all the iterations of the unrolled loop perform consecutive accesses, the compiler is implementing a burst-coalesced LSU. The some_complex_address_calculation is such that the application will see pretty good data locality and temporal locality. I find that the default cache which comes with burst-coalesced cached LSU isn't as efficient due to various reasons unknown to me.I'll appreciate it very much if you can provide more information about the following w.r.t. burst-coalesced cached LSU:1) What is the size of cache line?2) What is the eviction policy?3) Does this cache load n+1th block data when a read request for nth block is issued? (please note that block here is 64 bytes, so when iteration n of while loop is reading nth data block, can the cache pre-request n+1th block?)Thanks in advance

khaichein_y_intel · Answer

Hi,Please allow me some time to check on this.Thanks.

khaichein_y_intel · Answer

Hi,

The cache size depends on the memory size that you would like to access. If you are using the built-in calls that you can use for loading from and storing to global memory, you have to define using the Argument #2 of the Load Built-in. May I know which eviction policy you are referring to?

Thanks.

hrz · Answer

The details of the cache are not documented anywhere; however, in my experience:

1- The cache line size is equal to the size of the coalesced memory port. Moreover, by default the cache has 512 or 1024 lines (don't remember exactly since nowadays, I always disable the cache to prevent it from wasting precious Block RAMs).

2- It is probably something extremely simple like FIFO or LIFO. Best case scenario LRU.

3- I am pretty sure the cache doesn't pre-load anything.

In reality, exploiting your application's locality manually will always be more effective and efficient than relying on the extremely simple cache the compiler creates.

Forum Discussion

Eviction policy of burst-coalesced cached non-aligned LSU's cache

3 Replies

Recent Discussions

Agilex 7 FPGA Starter Kit with oneAPI Toolkit flow not detected over PCIe

MCTP over PCIe VDM routing to PMCI in OFS N6000 FIM configuration and datapath clarification

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

Error faced while executing on Agilex FPGA board....

AI Suite System Throughput Issue