I have also modified the code, but the RAM blocks usage become even worse larger than 64.
typedef struct{
short ff;
} filter_trans;
typedef struct{
filter_trans ww;
} data_trans;
typedef struct{
filter_trans ww;
} weight_trans;
__kernel(){
short __attribute__((numbanks(16),bankwidth(2))) w_local;
data_trans data_in = read_channel_intel(data_ch);
cont control = read_channel_intel(cont_ch);
weight_trans get_w;
# pragma unroll
for(int n=0; n<4; n++){
# pragma unroll
for(int j=0; j<16; j++){
get.ww.ff = w_local;
}
}
# pragma unroll
for(int n=0; n<4; n++){
winograd = 0;
# pragma unroll
for(int j=0; j<16; j++){
winograd += get_w.ww.ff * data_in.ww.ff;
}
}
}
Private memory Optimal
Total replication 1
Number of banks 16 (banked on lowest dimension)
Bank depth 16384 words
Additional information
Requested size 294912 bytes, implemented size 524288 bytes, stall-free, 16 reads and 16 writes.
Banked on lowest dimension into 16 separate banks.
Reducing accesses to exactly one read and one write for all on-chip memory systems may increase overall system performance.
See Best Practices Guide : Local Memory for more information.
Implemented size 524288 bytes
Bank width 16 bits
Requested size 294912 bytes
Private memory implemented in on-chip block RAM.