Forum Discussion
Altera_Forum
Honored Contributor
8 years agoI see. Based on what you describe, even though a linear extrapolation would give the impression that three kernel copies should also fit, it seems it is not the case, and both logic and RAM seem to be getting overutilized. Assuming that it is possible for you to decouple memory accesses from compute in your application and putting them in different kernels connected via channels, I recommend converting the compute part to an autorun kernel and then replicating it using the num_compute_units attribute (different functionality compared to when this attribute is used with NDRange kernels). In my experience, replicating single work-item autorun kernels using num_compute_units results in very small replication overhead.