Forum Discussion
You are performing random indirect accesses; the type of behavior you are observing is expected. The FPGA memory controller is extremely inefficient for random accesses and considering the low external memory bandwidth (25-35 GBps), you cannot get much scaling by increasing the number of threads either. Furthermore, the FPGA doesn't have a smart cache that can properly handle redundant random accesses. In contrast, on GPUs you get over 10 times external memory bandwidth, a much more efficient memory controller, and two levels of smart caches. I'm afraid there isn't much you can do to improve the performance of random indirect memory accesses. If you could at least make your accesses direct, it would probably help, but at the end of the day, if you want good memory performance on an FPGA, you need to have large, coalesced and aligned memory accesses.