Forum Discussion
Altera_Forum
Honored Contributor
8 years agoThanks for reply
So NDRange kernels are pipelined, but the pipelining is at thread level, not loop iteration level. Is that mean a group size of 64x64 kernel will execute at the same time, but every kernel will execute like normal c program, will not have effect like pipelined as single work item does? Is there any way have hybrid effect? and where can I check how many compute unit I have. When I double local group size from 1~8, the performance double from 1~8, however, when I keep scaling up from 8 to 16, the performance locked at 8 only increase a little. and like relu is a very simple SIMD function. I use: int i = get_global_id(0) if(input<0){input=0} This is also much (100x) slower than using single work item, like: for(int i=0;i<neuron;i++){ if(input<0){input=0} } Howcome this happened? pipelined at thread level is slower than loop iteration level? thread level wasted a lot of time stalling?