Forum Discussion
Altera_Forum
Honored Contributor
8 years ago1) Don't waste your time with Altera's emulator unless if you want to use it for debugging functionality when using the channels extension. For debugging other cases, just run your code on a standard CPU/GPU. Apart from this, Altera's emulator is purely functional and the run time you get under the emulator has nothing to do with the actual run time on the FPGA. Kernels getting faster or slower in the emulator doesn't mean they will get faster or slower on the actual FPGA either.
The errors you are getting are most likely caused by incorrect set up of the board; you should definitely update to the latest version of Nallatech's BSP and Quartus and AOC and make sure you are using the same BSP for compilation as the one that is used on the machine with the FPGA. Nallatech also sometime releases firmware updates for their boards which must be applied. Finally, whoever is responsible for setting up the board must read Nallatech's documents and make sure all steps have been done correctly, and then test the board with "aocl diagnose" before running any actual kernels on it. 2) As mentioned above, timing results from the emulator mean nothing. Unfortunately, there is no way to get correct timing (or even an estimation of it) without place and routing the kernel (only if I had a nickel for the number of times I have told Altera that providing a clock-accurate emulator should be at the top of their list of priorities). You can, however, use Altera's compilation report and area report (which have been significantly improved in v16.0 and 16.1) to get some idea of how to improve your kernels to achieve better performance; you must fully read Altera's Programming Guide and Best Practices Guide for OpenCL to understand how to interpret these reports. 3) Based on my experience, using single work-item is the correct approach in 80% of the cases. For cases where un-pipelinable loops exist in the kernel (e.g. nested loops with variable exit conditions) or kernels where memory accesses are random or not consecutive, NDRange will probably work better. Determining which kernel type to use needs a lot of experience, there is no fixed formula for this. Using Altera's optimization techniques will definitely help, but probably not enough to get comparable results to a proper CPU and GPU; you will likely need to re-design your algorithm for the specific architecture of FPGAs to get comparable performance. Regarding CPU with GPU comparison, if you are comparing against a proper CPU and GPUs (rather than extremely old or under-powered ones which a lot of people unfortunately do), you can very likely expect better performance against the CPU, but not the GPU. It is very hard to beat modern GPUs with current FPGAs, due to extremely low off-chip memory performance of the latter. Since you are doing academic work, you should consider reading the relevant related work; there are a lot of recent papers with OpenCL on FPGAs. Consider searching for "OpenCL Altera" or "OpenCL FPGA" in google scholar. This paper in particular has some examples of the performance difference between single work-item and NDRange kernels with different optimization levels, and also comparison with CPU and GPU of the same age as the Stratix V FPGA: http://dl.acm.org/citation.cfm?id=3014951