Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
8 years ago

Wrong results when running design on hardware

Hello,

My design is made of a chain of single work-item kernels transfering data using channels.

It runs fine on emulation, and the FPGA binary is built correclty (95% of estimated usage).

Here is my problem:

Both emulation and hardware run up to completion (no deadlock), but only the emulation produces correct results.

The machines used for development and deployment are different, and it is not possible to use the same machine for both steps.

The only part that is recompiled in the deployment machine is the host binary, so I guess that could be the issue but not sure where to start looking for the problem cause.

Also, the host part processes the output from the FPGA after the latter has finished. Could any host compilation be affecting results?

Did anyone experience a similar issue?

Any hints will be apprecciated.

Leonardo

12 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    Are you talking about num_compute_units for NDRange kernels or single work-item kernels? num_compute_units for NDRange kernels works in a fully automatic manner and does not require any user intervention other than adding the attribute to the kernel header. The compiler will automatically replicate the pipeline in this case, allowing multiple work-groups to be scheduled in parallel. This obviously comes at the cost of higher area usage and higher memory bandwidth utilization. If memory bandwidth is saturated, using num_compute_units will actually reduce performance due to extra memory contention.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    --- Quote Start ---

    Are you talking about num_compute_units for NDRange kernels or single work-item kernels? num_compute_units for NDRange kernels works in a fully automatic manner and does not require any user intervention other than adding the attribute to the kernel header. The compiler will automatically replicate the pipeline in this case, allowing multiple work-groups to be scheduled in parallel. This obviously comes at the cost of higher area usage and higher memory bandwidth utilization. If memory bandwidth is saturated, using num_compute_units will actually reduce performance due to extra memory contention.

    --- Quote End ---

    Thx for the reply.

    I meant from single work-item to NDRange. Original kernel process all pixels in one for loop, then I tried spiting the loop in half and launch two copies in parallel. By using NDRange I have to call get_global_id and later I found out that this will cause latency compare to not using it at all. I guess it's not a big deal when the kernels are complex and this 10ms doesn't cause a bottle neck, mimicking GPU programming on FPGA just won't pay off...