Forum Discussion
Altera_Forum
Honored Contributor
8 years agoThat calculation seems correct to me, though since you are only writing "temp.s0" to memory, it is possible that during synthesis, the extra computation for "temp.s1" to "temp.sF" are optimized out since their results are never used and hence, you are estimating the number of operations at 16 times more than it actually is. Again I recommend comparing the OpenCL compiler's area usage estimation with the final area utilization to see if things are getting optimized out.
Can you post a snippet of your host code where you are timing the kernel? Specifically, have you put a clFlush() or clFinish() after clEnqueueNDRangeKernel() and before reading the end time of the operation?