Forum Discussion
Altera_Forum
Honored Contributor
8 years agoActually, in normal circumstances, the compilation will fail during fitting in such cases and you will get an explicit message in stdout saying that Quartus failed to fit the design on the FPGA, but in your case it seems fitting actually finishes successfully (but with slowed OpenCL clock) and fails after that, which is pretty strange. Since AOC's estimation can be wrong at times (especially on Arria 10), the "top.fit.summary" will give you a much more accurate estimation of area usage.
Your kernel is not necessarily inefficient performance-wise, it is kinda inefficient from a coding effort point-of-view since you can easily use a for loop as I suggested and use the provided "#pragma unroll" to fully or partially unroll the loop based on the amount of area you have available on the FPGA. Regarding your questions: 1) It is hard to come up with a fixed formula as to when single work-item works better and when NDRange does, but I would suggest using single work-item in cases where loop-carried dependencies exist and can be resolved by using temporary registers or shift registers to achieve an Iteration Interval of one. For full data-parallel kernels, you might as well use NDRange. In the general case, I would consider single work-item as the first choice and if I couldn't achieve good performance due to unresolvable dependencies or un-puipelinable loops, I would switch to NDRange. 2) Yes, that is what I was referring to by using the term "thread".