Does the local memory usage increase with banking and coalescing?

Question

Does the number of M20K blocks increase when compiler automatically banks or increases the bank width?

Eg:-

int sample[4] [32];

int a[4][32];

int b[4][32];

#pragma unroll

for(j = 0; j < 4; j++) {

#pragma unroll

for (i = 0; i < 32; i++) {

sample[j][i] = a[j][i] + b[j][i];

}

In one of my programs which had a similar snippet as above, the M20K usage suddenly boosted up. The number of RAMs allocated are 103 which is quite strange. Happens even for a and b variables.

The area report shows no reason why the number of RAM blocks increased.

Requested size 512 bytes

Implemented size 512 bytes

Private memory Optimal

Total replication 1

Number of banks 1

Bank width 4096 bits

Bank depth 1 word

Additional information Requested size 512 bytes, implemented size 512 bytes, stall-free, 1 read and 1 write.

Reference See Best Practices Guide : Local Memory for more information.

Is the increase in RAM blocks due to the type of memory access?

hrz · Answer

The number of M20K blocks used indeed depends on the number and width of accesses to the local buffer. However, all of this info will be reflected in the report. If the report says the implemented size is 512 bytes but 103 blocks are allocated, these blocks are likely being used in the other parts of the circuit. e.g. as buffers between the kernel and the memory interface, as on-chip cache for global memory accesses, as FIFOs in the pipeline or to implement channels, etc. If you check the "Area analysis by source" part of the area report, you can get a detailed break-down of where each resource is being used. If you post your full kernel code, I can generate the report and give you more details.

mvemp · Answer

#define DIM_3 2
#define DIM_4 8
 
#pragma OPENCL EXTENSION cl_intel_channels : enable

typedef char QTYPE;
typedef int  HTYPE;

typedef struct {
   QTYPE data[DIM_3];
} group_data;

typedef struct {
   group_data lane[DIM_4];
} group_vec;

typedef struct {
   QTYPE lane[DIM_4];
} group_ch;

channel group_vec    data_ch    __attribute__((depth(0)));
channel group_vec    weight_ch    __attribute__((depth(0)));
channel group_ch   out_ch  __attribute__((depth(0)));
 
__kernel
__attribute__((task))
__attribute__((max_global_work_dim(0)))
void fetch_data(
 
	__global group_data *restrict bottom
 
	)
 
{
	group_data data_vec;
	group_vec data_ch_out;

for(unsigned int  win_itm_xyz=0; win_itm_xyz&lt; 39 * 39 * 4096/(DIM_3); win_itm_xyz++){
		
			data_vec = bottom[win_itm_xyz];
			#pragma unroll
			for(unsigned char ll=0; ll&lt;DIM_4; ll++){
				data_ch_out.lane[ll] = data_vec;
			}
					
			write_channel_intel(data_ch, data_ch_out);

}
}

__kernel
__attribute__((task))
__attribute__((max_global_work_dim(0)))
void fetch_weights(
 
			__global volatile group_vec  *restrict weights 
			
			   )
 
{
		group_vec weight_vec;
		for(unsigned int  win_itm_xyz=0; win_itm_xyz&lt; 39 * 39 * 4096/(DIM_3); win_itm_xyz++){
		
			weight_vec = weights[win_itm_xyz];
					
			write_channel_intel(weight_ch, weight_vec);
		}
 
}

__kernel
__attribute__((task))
__attribute__((max_global_work_dim(0)))
void conv_wino(
 
			)
{

group_vec data_vec;
	group_vec weight_vec;
	group_ch convout;
	HTYPE conv_out[169][DIM_4];
	group_ch inv_wino_out[4];
	uint array_index;
 
	for(uint output = 0; output &lt; 39 * 39; output++) {
		for(unsigned int  win_itm_xyz=0; win_itm_xyz&lt; 4096/DIM_3; win_itm_xyz++){
			data_vec = read_channel_intel(data_ch);
			weight_vec = read_channel_intel(weight_ch); 
		
		#pragma unroll
		for(uint i = 0; i &lt; DIM_4; i++) {
			#pragma unroll
			for(uint j = 0; j&lt; DIM_3; j++) {
				convout.lane[i] += data_vec.lane[i].data[j] * weight_vec.lane[i].data[j];
				}
		
			}
						
		}
		#pragma unroll
		for(unsigned char ll_t=0; ll_t&lt;DIM_4; ll_t++){
			conv_out[array_index][ll_t] = convout.lane[ll_t];	
		}
		if (array_index == 169 - 1){
			array_index = 0;
		}	
		else
			array_index++;
 
		}
 
		#pragma unroll
		for(unsigned char ll_t=0; ll_t&lt;DIM_4; ll_t++){
 
		inv_wino_out[0].lane[ll_t]  =   conv_out[0][ll_t]  + conv_out[1][ll_t] + conv_out[2][ll_t] + conv_out[1][ll_t] + conv_out[5][ll_t] + conv_out[9][ll_t] + conv_out[2][ll_t] + conv_out[6][ll_t] + conv_out[10][ll_t];
		//printf("
 %d ", inv_win_out[0][ll_t]);
		inv_wino_out[1].lane[ll_t] =    conv_out[0][ll_t] + conv_out[5][ll_t] + conv_out[9][ll_t] - conv_out[2][ll_t] - conv_out[6][ll_t] - conv_out[10][ll_t] - conv_out[3][ll_t] - conv_out[7][ll_t] - conv_out[11][ll_t];
		//printf("
 %d ", inv_win_out[1][ll_t]);
		inv_wino_out[2].lane[ll_t] =	conv_out[4][ll_t] + conv_out[9][ll_t] - conv_out[12][ll_t] + conv_out[5][ll_t] - conv_out[157][ll_t] - conv_out[13][ll_t] + conv_out[6][ll_t] - conv_out[10][ll_t] - conv_out[14][ll_t];
		//printf("
 %d ", inv_win_out[2][ll_t]);
		inv_wino_out[3].lane[ll_t] =	conv_out[5][ll_t] - conv_out[16][ll_t] - conv_out[13][ll_t] - conv_out[6][ll_t] + conv_out[10][ll_t] + conv_out[14][ll_t] - conv_out[7][ll_t] + conv_out[11][ll_t] + conv_out[15][ll_t];		
		//printf("
 %d ", inv_win_out[3][ll_t]);
		}
 
		for(unsigned char ll_t=0; ll_t&lt;4; ll_t++)
		{
			write_channel_intel(out_ch, inv_wino_out[ll_t]);
 
		}

}

// Store Data to Global Memory
__kernel
__attribute__((task))
__attribute__((max_global_work_dim(0)))
void WriteBack(
 
                __global group_ch *restrict top
				)
{
 
	uint array_index;
 
	uchar  index_z_item; // max value 256
	ushort index_z_group;// max value 4096
 
	group_ch output;

for(uint dd = 0; dd&lt; 4; dd++){
			output = read_channel_intel(out_ch);
			top[dd]  =  output;  
				//printf("
 index: %d, Output buffer : %d ",  dd, output.lane[ll] );
			}  
}

In the code snippet, the local memory allocated to conv_out = 103. I checked area analysis by source. But no line is mentioned about 103 RAM blocks.

hrz · Answer

I am surprised the report is not correctly reflecting the implemented size of the buffer. Anyway, based on the report, the bank width is 2048 bits, while the maximum width of the Block RAM ports is 40 bits. This means that a replication factor of at least 52 is required to provide enough ports to implement the buffer. Furthermore, each instance of the buffer is 8192 bytes which requires 3-4 Block RAMs (depending on the depth) to implement. Since the Block RAMs are double-pumped, the number of required Block RAMs will then be halved. I think 103 Block RAMs in the end is a reasonable number.

mvemp · Answer

Can I conclude every bank get mapped to 1 M20K and every 40bits bankwidth gets mapped to 1M20K?

hrz · Answer

Actually I am not sure if the compiler configures the Block RAMs with a width of 32 bits or 40 bits. Either way, you should also take the size of the buffer into account (the size itself might require more Block RAMs than the minimum number that is needed to provide the necessary number of ports for all accesses). Moreover, the type of accesses also matter. Writes need to be connected to all buffer replicas, while reads need to be connected to one. with double-pumping you effectively get 4 ports per Block RAM. e.g, for 5 reads and 1 write, you need 2 Block RAMs per every 32-bit (or 40-bit) bank width, but for 5 reads and 2 writes you will need 3. To be honest, accurate prediction of Block RAM usage is not very straightforward.

Forum Discussion

Does the local memory usage increase with banking and coalescing?

5 Replies

Recent Discussions

How to fix Error(23782): Failed to find an expected report

Quartus 22.1 and 23.1 Synthesis Error

Connection bit order between hierarchy

Could not link 'vsim_auto_compile.dll' error troubleshooting.

Failed to run ip-setup-simulation: