Forum Discussion
I have checked the other PEs and also the final stage which receives the final output and writes it back to the memory. they suffer from the same stall.
I'm kinda afraid that I'm not capturing the PEs counter numbers properly.
BTW, about the comments, that's how a software engineer survives FPGA programming :D
Then it sounds like the stalls are propagating from the bottom of the pipeline. I am afraid I have never profiled autorun kernels, so I cannot comment on the correctness of the way you are capturing the counters. However, I find it very unlikely for regular compute PEs or on-chip channel to become a performance bottleneck. As a test, you can remove all your PEs from the kernel, and just keep the memory read/write kernels directly connected to each other through a channel. If you get similar stalling on this simplified kernel, the problem is coming from memory. Note that if you are exhausting the external memory bandwidth, seeing such stalls on the channels is completely normal.