local memory bank

Question

I have read best practice guide, but I am still confused.   I have optimize the local memory to 1 read and 1 write. However, the report.html report that the"w_local" memory use 64 RAM blocks. I know the multiply unroll 64 times, so I need to get 64 datas(64*16=1024 bits) in 1 clock, but since the local memory optimize to 1 read and each read read 1024 bits, therefor I use only 1 RAM block not 64 RAM blocks, right?   
typedef struct{
    short ff;
} filter_trans;
typedef struct{
    filter_trans ww;
} data_trans;
typedef struct{
    filter_trans ww;
} weight_trans;
__kernel(){
      weight_trans w_local;
      data_trans data_in = read_channel_intel(data_ch);
      cont control = read_channel_intel(cont_ch);
      weight_trans get_w = w_local;
     # pragma unroll      
       for(int n=0; n&lt;4; n++){
            winograd = 0;
           # pragma unroll
            for(int j=0; j&lt;16; j++){
                winograd += get_w.ww.ff * data_in.ww.ff;
            }
      }
}
  "w_local"     Private memory: Optimal     Requested size: 73728 bytes     Implemented size: 131072 bytes      Number of banks: 1     Bank width: 1024 bits     Bank depth: 1024 words     Total replication: 1     Additional information: Requested size 73728 bytes, implemented size 131072 bytes, stall-free, 1 read and 1 write.      - See Best Practices Guide: Local Memory for more information.     Private memory implemented in on-chip block RAM.

altera_forum · Answer

I have also modified the code, but the RAM blocks usage become even worse larger than 64.   
typedef struct{
    short ff;
} filter_trans;
typedef struct{
    filter_trans ww;
} data_trans;
typedef struct{
    filter_trans ww;
} weight_trans;
__kernel(){
      short __attribute__((numbanks(16),bankwidth(2))) w_local;
      data_trans data_in = read_channel_intel(data_ch);
      cont control = read_channel_intel(cont_ch);
      weight_trans get_w;
     # pragma unroll      
       for(int n=0; n&lt;4; n++){
           # pragma unroll
            for(int j=0; j&lt;16; j++){
                get.ww.ff = w_local;
            }
      }
      
     # pragma unroll      
       for(int n=0; n&lt;4; n++){
            winograd = 0;
           # pragma unroll
            for(int j=0; j&lt;16; j++){
                winograd += get_w.ww.ff * data_in.ww.ff;
            }
      }
}
  Private memory	Optimal Total replication	1 Number of banks	16 (banked on lowest dimension) Bank depth	16384 words Additional information	 Requested size 294912 bytes, implemented size 524288 bytes, stall-free, 16 reads and 16 writes.  Banked on lowest dimension into 16 separate banks. Reducing accesses to exactly one read and one write for all on-chip memory systems may increase overall system performance. See Best Practices Guide : Local Memory for more information. Implemented size	524288 bytes Bank width	16 bits Requested size	294912 bytes Private memory implemented in on-chip block RAM.

altera_forum · Answer

--- Quote Start ---

I know the multiply unroll 64 times, so I need to get 64 datas(64*16=1024 bits) in 1 clock,

but since the local memory optimize to 1 read and each read read 1024 bits, therefor I use only 1 RAM block not 64 RAM blocks, right?

--- Quote End ---

No, even without taking replication into account, your buffer has a size of 576 * 1024 = 589824 bits, which, considering the size of the Block RAMs being 20kb, you need at least 30 blocks just to fit the buffer. Furthermore, each Block RAM has two 32-bit ports; obviously, you cannot read 1024 bits per clock from a 32-bit port. The write port has to be connected to every Block RAM used to implement the buffer and the 1024-bit read port is split between them which requires a minimum of 32 Block RAMs to provide enough ports. Adding other overheads (address calculation, routing, etc.), the compiler ends up using 64 Block RAMs. This configuration is optimal and is unlikely to be improvable.

Forum Discussion

local memory bank

2 Replies

Recent Discussions

Invalid license key (inconsistent authentication code)

memory infer

qsys-generate outputs Info as Error

Timing analysis - long combinational path

Regarding the issue of UFM not starting