Hello, My design is made of a chain of single work-item kernels transfering data using channels. It runs fine on emulation, and the FPGA binary is built correclty (95% of estimated usage). Here is my problem: Both emulation and hardware run up to completion (no deadlock), but only the emulation produces correct results. The machines used for development and deployment are different, and it is not possible to use the same machine for both steps. The only part that is recompiled in the deployment machine is the host binary, so I guess that could be the issue but not sure where to start looking for the problem cause. Also, the host part processes the output from the FPGA after the latter has finished. Could any host compilation be affecting results? Did anyone experience a similar issue? Any hints will be apprecciated. Leonardo

Your problem is very likely not caused by host compilation, but rather a race condition or some other issue that does not show up in the emulator (and there are a lot of them). I have so far encountered two such cases: - Two kernels running in parallel, one updating an off-chip memory location and then sending a "completion" flag to the other kernel, and then the other kernel reading that memory location. This will work correctly in the emulator, since the emulator does NOT emulate concurrency, but will produce incorrect results on the actual FPGA. - Accidentally reducing the scope of a variable more than necessary. For example in the following code, the output will be 10 on the emulator, but it will be 0 on the FPGA: int sum = 0; for (int i=0; i<1; i++) { int sum = 10; } printf("%d", sum);

Hi HRZ, Yes, my design falls into the first case you mentioned. I thought that adding a fence on global memory would ensure consistency, i.e. something like this was initially added in my design: Kernel updating off-chip memory: void Krnl_Store( ... ) { ... // writing to global-memory location write_mem_fence(CLK_GLOBAL_MEM_FENCE); write_channel_altera(chan_Store2GG_ack, 1); ... } Kernel reading off-chip memory: void Krnl_GG( ... ) { ... ack = read_channel_altera(chan_Store2GG_ack); // reading from global-memory location ... } It seems adding a fence in the writing kernel doesn't work, so, what would be a possible solution for this case? Thank you, Leonardo

I only tried doing this once, with multiple different barrier configurations, but none of the configurations worked in the end. Since debugging on the FPGA is too time-consuming and troublesome, and the emulator fails to correctly emulate and show this behavior, I gave up on that design and merged the two kernels into one to make sure I would get correct results. Even though the "CLK_GLOBAL_MEM_FENCE" seems to be supposed to avoid such race conditions, in practice it doesn't seem to work as intended. Note that the OpenCL specification does NOT guarantee global memory consistency unless at the end of kernel execution and hence, Altera doesn't have to provide the means to avoid such problems. I would suggest seeking an alternative kernel design. You could also open a ticket directly with Altera and ask them why the barrier is not working as it should, in this case.

--- Quote Start --- I only tried doing this once, with multiple different barrier configurations, but none of the configurations worked in the end. Since debugging on the FPGA is too time-consuming and troublesome, and the emulator fails to correctly emulate and show this behavior, I gave up on that design and merged the two kernels into one to make sure I would get correct results. Even though the "CLK_GLOBAL_MEM_FENCE" seems to be supposed to avoid such race conditions, in practice it doesn't seem to work as intended. Note that the OpenCL specification does NOT guarantee global memory consistency unless at the end of kernel execution and hence, Altera doesn't have to provide the means to avoid such problems. I would suggest seeking an alternative kernel design. You could also open a ticket directly with Altera and ask them why the barrier is not working as it should, in this case. --- Quote End --- Merging kernels was the solution. Thanks!

Wrong results when running design on hardware

12 Replies

Altera_Forum
Honored Contributor
8 years ago
Are you talking about num_compute_units for NDRange kernels or single work-item kernels? num_compute_units for NDRange kernels works in a fully automatic manner and does not require any user intervention other than adding the attribute to the kernel header. The compiler will automatically replicate the pipeline in this case, allowing multiple work-groups to be scheduled in parallel. This obviously comes at the cost of higher area usage and higher memory bandwidth utilization. If memory bandwidth is saturated, using num_compute_units will actually reduce performance due to extra memory contention.
Altera_Forum
Honored Contributor
8 years ago
--- Quote Start ---
Are you talking about num_compute_units for NDRange kernels or single work-item kernels? num_compute_units for NDRange kernels works in a fully automatic manner and does not require any user intervention other than adding the attribute to the kernel header. The compiler will automatically replicate the pipeline in this case, allowing multiple work-groups to be scheduled in parallel. This obviously comes at the cost of higher area usage and higher memory bandwidth utilization. If memory bandwidth is saturated, using num_compute_units will actually reduce performance due to extra memory contention.
--- Quote End ---

Thx for the reply.
I meant from single work-item to NDRange. Original kernel process all pixels in one for loop, then I tried spiting the loop in half and launch two copies in parallel. By using NDRange I have to call get_global_id and later I found out that this will cause latency compare to not using it at all. I guess it's not a big deal when the kernels are complex and this 10ms doesn't cause a bottle neck, mimicking GPU programming on FPGA just won't pay off...

Forum Discussion

Wrong results when running design on hardware

12 Replies

Recent Discussions

Invalid license key (inconsistent authentication code)

memory infer

qsys-generate outputs Info as Error

Timing analysis - long combinational path

Regarding the issue of UFM not starting