Data is read before accumulation is finished

Question

===============================================================

for(uint t = 0; t < loop_cnt; t++) {

//load data to data buffer

for(uint w = 0; w < TILE_WIDTH; w++) {

data[w] = read_channel_altera(data_in_ch);

}

for(uint h = 0; h < TILE_HEIGHT; h++) {

weight[h] = read_channel_altera(weight_in_ch);

}

//comput the matrix tile multiplication using the PE(mac) array

# pragma unroll

for(uint w = 0; w < TILE_WIDTH; w++) {

float data_temp = data[w];

# pragma unroll

for(uint h = 0; h < TILE_HEIGHT; h++) {

float weight_temp = weight[h];

float temp = data_temp * weight_temp;

if(t == 0)

output[h * TILE_WIDTH + w] = temp;

else

output[h * TILE_WIDTH + w] = output[h * TILE_WIDTH + w] + temp;

}

//declare output data to be enqueued in altara channel

lane output_lane;

for(uint w = 0; w < TILE_WIDTH; w++) {

# pragma unroll

for(uint h = 0; h < TILE_HEIGHT; h++) {

//multiply with scale and plus bias before moving it out

output_lane.lane_data[h] = output[h * TILE_WIDTH + w] * scale[h] + bias[h];

}

write_channel_altera(output_ch, output_lane);

}

========================================================================================

Here is a snippet of my code. Basically what I am doing is doing matrix multiplication and move the data out by channel if the accumulation is finished. But according to the hardware run, the output is not fully accumulated (it's moved out before the accumulation is finished, for example, if the correct output pattern is all 36, the hardware run result would be a mix of values smaller than 36). And the compilation report seems to support this (with TILE_WIDTH 4 and TILE_HEIGHT 8, the number of simultaneous reads to output local buffer should be 32, but in the report it's 40, which is because after accumulation I have 8 simultaneous reads to move the data out (32 + 8 = 40). So it looks like the accumulation and moving out is happening at the same time!! This is very weird because moving out should happen after accumulation is finished.

below is the report of local buffer output

===========================================================================================

Local memory: Optimal. Requested size 128 bytes (rounded up to nearest power of 2), implemented size 128 bytes, stall-free, 40 reads and 32 writes. Additional information: - Banked on lowest dimension into 32 separate banks (this is a good thing). - Reducing accesses to exactly one read and one write for all on-chip memory systems may increase overall system performance.

==========================================================================================

And advice would be greatly appreciated!!

altera_forum · Answer

The 40 reads that the compiler reports is correct and does not actually have the implication you think it does. Your kernel will turn into one single pipeline, and there will be 40 read ports going from the pipeline to the "output" buffer. It does not matter in which part of the code those reads are located; the compiler will make sure all reads can be satisfied in parallel to avoid any stalls in the pipeline. Note that all of these ports will be reported in the same part of the report, you should not expect to see 32 ports in one part and another 8 ports in another part of it.

Regarding outputs being sent to the channel before being fully accumulated: this is not possible. The compiler will ensure the loop on "t" fully finishes before the loop on the output channels starts. If you are seeing different output that what you expect, the problem is somewhere else. Have you tried to see what output you receive in the emulator and debug by printing the intermediate values?

A few tips:

1- Be very careful with using uint loop variables; if you compare "unit" with "int", you could get very different behavior compared to what you expect. Specifically for the loop on "t", "loop_cnt" must be also uint for correct behavior.

2- You might have to initialize the output buffer; depending on how the compiler actually unrolls the loops, the accumulation line might result in undefined behavior if the buffer is not initialized.

3- You should probably consider changing the order of your "h" and "w" loops, or transpose your "output" buffer, so that when you unroll the loops, the unrolled accesses to the buffer can be coalesced, resulting in larger but less read and write ports. With your current implementation, if you keep increasing the size of the output buffer to a point that it has to be implemented using Block RAMs, you will get a really large replication factor from the compiler to support all the non-coalesced read and write ports.

altera_forum · Answer

Hi HRZ,

Thanks for your kindly reply!

I will follow your advice and make some changes! If you don't mind, would you like to see my source code and report so that you can see my problems better? You help would be greatly appreciated!!

Best regards,

Jiang Wenbo

altera_forum · Answer

I can take a look at your code, but I will be mostly unavailable in the following week and might not be able to help you much.

altera_forum · Answer

It's okay, I've been stuck here for more than one month. Do you have a personal email?

altera_forum · Answer

Since the board does not seem to allow private messages and I prefer not to post my email address directly on an open forum to avoid it being picked up by bots, please check this page (https://github.com/fpga-opencl-benchmarks/rodinia_fpga) for my email address. I am the second guy in the contact list (at the very bottom).

Forum Discussion

Data is read before accumulation is finished

9 Replies

Recent Discussions

Generate Simulation Setup Script Fails

FIR IP configured for Interpolation

Altera SSLC License

Lisence issue when running .do script

How to create a Packaged Subsystem in TCL