Unexpected Error for the different FOR loop trip count
I have been trying to perform panel-by-panel matrix multiplication at the block level. The pseudo algorithm is as follows:
__attribute__ ((reqd_work_group_size (1, 1, 1)))
__kernel void matmul_panel_fpga_cl (
_global float * a,
__global float * b,
__global float * c,
const int m,
const int n,
const int k,
const int num_of_m_blocks,
const int num_of_n_blocks
)
for (int a = 0; a <num_of_m_blocks; a + = M_STEP) {// num_of_m_blocks in panelA
for (int it = 0; it <M_STEP; it ++) {
pack_a_matrix ();
}
for (int bb = 0; bb <num_of_n_blocks; bb ++) {// num_of_n_blocks in panelB
pack_b_matrix ();
for (int ab = 0; ab <M_STEP; ab ++) {
pack_c_matrix ();
packed_matrix_multiply_c_a * b ();
return_pack_c ();
}
}
}
The above kernels work fine when the number of kernel invocation (in other words number of panels) is equal to the number of num_of_n_blocks. But when they are different then it returns garbage values in the packed_c. I have used the clFinish () every time But I do not understand how this has a relation of num_of_n_blocks to a number of kernel invocations.
I have been using OpenCL FPGA SDK 20.3
Please help us to understand.