Forum Discussion
Altera_Forum
Honored Contributor
8 years agoThanks again for your help.
I have now executed one of my kernels on an actual FPGA. However, I do not get any speed up or slow down when I change my kernel from NDrange to single-work-item (which I was not expecting at least for this simple kernel). The kernel I am executing on the FPGA contains the following: uint2 pixel = (uint2) (get_global_id(0),get_global_id(1)); depth= ... and to change it to single-work-item I changed it this way (Also replacing clEnqueueNDRangeKernel to clEnqueueTask in the host). for(uint pixel_y=0;pixel_y<240;pixel_y++){ for(uint pixel_x=0;pixel_x<320;pixel_x++){ depth=.... } } Is this the correct way of changing to single-work-item kernel (I did not find a proper method anywhere)? If not how should do it?, what is your suggestion on to improve the execution time of this kernel? How about for a more complex kernel in my application like this. Shall I change to single work item (like above) and follow the optimization report or follow the guide on "how to improve NDRange kernels"? const uint2 pos = (uint2) (get_global_id(0),get_global_id(1)); const uint2 size = (uint2) (get_global_size(0),get_global_size(1)); const float center = in[pos.x + size.x * pos.y]; if ( center == 0 ) { out[pos.x + size.x * pos.y] = 0; return; } for(int i = -r; i <= r; ++i) { for(int j = -r; j <= r; ++j) { const uint2 curPos = (uint2)(clamp(pos.x + i, 0u, size.x-1), clamp(pos.y + j, 0u, size.y-1)); const float curPix = in[curPos.x + curPos.y * size.x]; if(curPix > 0) { sum += factor; } } } out[pos.x + size.x * pos.y] = t / sum; The reason I am asking is that for the moment, I have to stick with the current old version of AOCL.I want to get a feeling of what to think about and follow the correct optimization path from the start while not having to wait a few hours for each method and approach to compile for me to see the timing results. Thank you very much.