Forum Discussion
Altera_Forum
Honored Contributor
8 years agoWell, this trade-off always exists that if your host to device transfer takes longer than your compute, you just compute on the host...
There is a note in Altera's documents that your OpenCL buffers must be 64-bit aligned to get full performance of DMA through PCI-E for host to device transfers, but I don't think this applies to the Cyclone SoCs since there is no PCI-E. On the other hand, I was under the impression that you can have shared memory between the ARM and the FPGA on these SoCs so that everything you malloc on the ARM is directly accessible to the FPGA. You can try passing the host pointer to the FPGA (CL_MEM_USE_HOST_PTR), instead of copying the data, and see what happens.