for loop pipelined with NDRange

Honored Contributor

8 years ago

Thanks for reply

So NDRange kernels are pipelined, but the pipelining is at thread level, not loop iteration level. Is that mean a group size of 64x64 kernel will execute at the same time, but every kernel will execute like normal c program, will not have effect like pipelined as single work item does? Is there any way have hybrid effect?

and where can I check how many compute unit I have. When I double local group size from 1~8, the performance double from 1~8, however, when I keep scaling up from 8 to 16, the performance locked at 8 only increase a little.

and like relu is a very simple SIMD function.

I use:

int i = get_global_id(0)

if(input<0){input=0}

This is also much (100x) slower than using single work item, like:

for(int i=0;i<neuron;i++){

if(input<0){input=0}

}

Howcome this happened? pipelined at thread level is slower than loop iteration level? thread level wasted a lot of time stalling?

Forum Discussion

Recent Discussions

Regarding the issue of UFM not starting

ram retiming

Reset Release IP for Agilex needs Stratix 10 device files installed!

Licensing ‘Know-How’ Guide

Timing analysis - long combinational path