Eviction policy of burst-coalesced cached non-aligned LSU's cache
Hi,
I am implementing an application using OpenCL targeting Intel Arria 10 GX 1150.
typedef struct {
char data[8];
} block;
block buf;
while(true) {
global_ptr = some_complex_address_calculation;
#pragma unroll
for(int i = 0; i < 8; i++) {
buf.data[i] = global_ptr[i];
}I'm performing 8 consecutive reads each of which is a struct with 8 bytes. The compiler converts this to a 64 byte DDR read. Since all the iterations of the unrolled loop perform consecutive accesses, the compiler is implementing a burst-coalesced LSU. The some_complex_address_calculation is such that the application will see pretty good data locality and temporal locality. I find that the default cache which comes with burst-coalesced cached LSU isn't as efficient due to various reasons unknown to me.
I'll appreciate it very much if you can provide more information about the following w.r.t. burst-coalesced cached LSU:
1) What is the size of cache line?
2) What is the eviction policy?
3) Does this cache load n+1th block data when a read request for nth block is issued? (please note that block here is 64 bytes, so when iteration n of while loop is reading nth data block, can the cache pre-request n+1th block?)
Thanks in advance