Altera_Forum
Honored Contributor
8 years agoData is read before accumulation is finished
===============================================================
for(uint t = 0; t < loop_cnt; t++) { //load data to data buffer for(uint w = 0; w < TILE_WIDTH; w++) { data[w] = read_channel_altera(data_in_ch); } for(uint h = 0; h < TILE_HEIGHT; h++) { weight[h] = read_channel_altera(weight_in_ch); } //comput the matrix tile multiplication using the PE(mac) array # pragma unroll for(uint w = 0; w < TILE_WIDTH; w++) { float data_temp = data[w]; # pragma unroll for(uint h = 0; h < TILE_HEIGHT; h++) { float weight_temp = weight[h]; float temp = data_temp * weight_temp; if(t == 0) output[h * TILE_WIDTH + w] = temp; else output[h * TILE_WIDTH + w] = output[h * TILE_WIDTH + w] + temp; } } } //declare output data to be enqueued in altara channel lane output_lane; for(uint w = 0; w < TILE_WIDTH; w++) { # pragma unroll for(uint h = 0; h < TILE_HEIGHT; h++) { //multiply with scale and plus bias before moving it out output_lane.lane_data[h] = output[h * TILE_WIDTH + w] * scale[h] + bias[h]; } write_channel_altera(output_ch, output_lane); } ======================================================================================== Here is a snippet of my code. Basically what I am doing is doing matrix multiplication and move the data out by channel if the accumulation is finished. But according to the hardware run, the output is not fully accumulated (it's moved out before the accumulation is finished, for example, if the correct output pattern is all 36, the hardware run result would be a mix of values smaller than 36). And the compilation report seems to support this (with TILE_WIDTH 4 and TILE_HEIGHT 8, the number of simultaneous reads to output local buffer should be 32, but in the report it's 40, which is because after accumulation I have 8 simultaneous reads to move the data out (32 + 8 = 40). So it looks like the accumulation and moving out is happening at the same time!! This is very weird because moving out should happen after accumulation is finished. below is the report of local buffer output ===========================================================================================- Local memory: Optimal. Requested size 128 bytes (rounded up to nearest power of 2), implemented size 128 bytes, stall-free, 40 reads and 32 writes. Additional information: - Banked on lowest dimension into 32 separate banks (this is a good thing). - Reducing accesses to exactly one read and one write for all on-chip memory systems may increase overall system performance.