Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
8 years ago

How OpenCL synthesizes hardware on FPGA

Hi,

I have some doubts concerning how OpenCL synthesizes hardware into FPGA. Both in single work-item and NDRange kernels, in the "vector_add" example (available on https://www.altera.com/support/support-resources/design-examples/design-software/opencl/vector-addition.html) how is the hardware realized into FPGA? In the above example, the kernel (NDRange mode) executes one milion of sums and I would like to say how the hardware is realized into the FPGA (if I use the single-work item kernel instead of NDRange kernel how does the hardware change respect to NDRange case?). Thanks for your help

Marco Montini

4 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I read all the topics you gave me and I also read the section 1.3 of Intel FPGA SDK for OpenCL Best Practises Guide but I still have some doubts. If I am using a single work-item kernel to do a vector add, as in the example, I know that the loop iterations are pipelined but how can I know the hardware that is synthesized inside the FPGA? If I see the images in the best practises guide it seems there is just one adder,two registers for loading and one for storing. Is it the real hardware created inside the FPGA? If yes then the data for the operations can be acquired by accessing N times to DDR (for global variables). Thanks

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    In the specific case of vector-add, whether the kernel is NDRange or single work-item, the compiler will create one adder and three ports to global memory (two reads and one write), plus some buffers between global memory and the kernel to absorb possible stalls and some registers to allow pipelining. In this case, 2N values will be read from global memory, and N values will be written, with three values being read/written per clock. This will obviously result in poor performance; hence, SIMD (for NDRange kernels) and unrolling (for single work-item) can be used to increase the number of adders that are synthesized, and widen the ports to memory, to allow more data to be loaded and added per clock to improve performance.