Forum Discussion
Just to elaborate on what I realize is an otherwise broad question...
If I were optimizing my example I'd look to unroll the lcg48.step() loop and then parallelize the work items enough to use all the memory bandwidth. Then each work item "column" could be pipelined. For example if there's 256 bit wide memory, I'd have <=8 pipelines (256/32) and each pipeline would process ceil(20/8) work items. (Ok, my choice of example parameters isn't all that great, but I kept things small so I could see things. You get the idea ;^)). So back to the original question. I can use "#pragma unroll" to unroll the inner loop (and observe that i the report.html). What I don't see how to control how many parallel pipelines are created. (see note). I don't think it's happening automatically; or a least I don't see anything in the report.html to suggest that it is.
Hope that narrows things a little bit.
note: unrolling both loops in the k1a and you do see what looks like parallelism in the graph view of report.html.
- CFR5 years ago
New Contributor
Just to follow up in case someone else might be trying similar experiments...
I took the OneAPI/FPGA Tutorial at FPL2020 (highly recommend it). One of the things stressed in the tutorial was that the compilers are better at handling "single_task" and not so much good at "parallel_for". It was really stressed to use "single_task". This didn't come across to me in the documentation.