About Altera OpenCL Compilation

Honored Contributor

11 years ago

--- Quote Start ---

Thank you! I am just wondering if I call a "sub" function from my "main" kernel multiple times (ie function call is under a "for" loop) would it cause higher overhead, or cause the compiler to use more hardware?

--- Quote End ---

The compiler inlines functions into hardware instead of "calling" them, so the answer depends on whether the loop gets automatically unrolled. Auto-unrolling can occur if the loop body is relatively small, and if the loop has a small fixed trip count. If unrolled, the function that you're calling will be replicated multiple times which uses additional hardware. Similarly, if you call the function multiple times from within a kernel, then you will incur the overhead of the function being instantiated multiple times in the hardware. The compiler does this because replicating the hardware provides higher throughput when you have many workitems - different sets of work items can use different instances of the function in parallel.

--- Quote Start ---

Will this way (use one kernel to call sub functions) of programming in general cause the higher hardware utilization or degrade performance compare to having multiple smaller kernels?

--- Quote End ---

Yes, it will most likely degrade the performance of all of the kernels. Local memory systems are optimized for the kernel that they connect to. When you fuse the functionality of multiple kernels, each of which likely has a different type of memory access pattern, you will end up with a complex local memory system that isn't optimized for any one of the kernels. I would expect to see more stalls/access conflicts on the memory, and a higher hardware cost.

The single fused kernel would also be much more complex, which may prevent some compiler optimizations (especially memory) that could be performed on each of the smaller, simpler kernels.

Your previous question asked how to effectively share a local memory across kernels. That's the definition of a global memory, so if at all possible, I would suggest re-working your multiple kernels to either: (1) share global memory instead of local; or (2) find a reasonable balance between local memory sizes such that each kernel can have its own local mem. If you care about throughput/performance and not just fitting multiple kernels each requiring massive local memory onto the chip (at the cost of performance), then fusing kernels is probably not the best way to go.

Forum Discussion

Recent Discussions

Generate Simulation Setup Script Fails

FIR IP configured for Interpolation

Altera SSLC License

Lisence issue when running .do script

How to create a Packaged Subsystem in TCL