Forum Discussion

Altera_Forum's avatar
Altera_Forum
Icon for Honored Contributor rankHonored Contributor
8 years ago

NDRnage Kernels Global Memory Write Pattern

I came up with some basic questions, which I would like to discuss:

1) I still don't understand how SIMD is being implemented in FPGA. In GPU, workitems being assigned to SIMDs are being executed concurrently. Is it the same case for FPGA? Or they are just going to be interleaved? For example in case of simd 16, Does 16 workitems being scheduled to a compute unit and being executed in interleaved fashion?

2) In case SIMD in FPGA is not as parallel as GPU, does 16 workitems being scheduled at once to a compute unit and wait to finish? or the next work items can still come in and be pushed into the pipeline?

3) Imagine a case where every work-item only writes one value at the end of execution into the global memory, and it writes it to index of "globalid" of that work item. In case of having many compute units and having SIMD of 16, at each clock cycle many write operations will be issued with non-continuous addresses (Based on my understanding). This seems to be inefficient with regards to high performance memory access. Does that mean, kernels designed for GPU are not suitable for FPGA, with regards to their memory access pattern?

4) Does LSU (Load Store Unit) performs memory coalescing? In other words, does it have any kind of buffer to receive memory write operations, and then flush them into the memory after grouping them into multiple continuous blocks of data?

16 Replies

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    M, N, and P are all constants and MF is basically the same as M. You're doing the same math on the same inputs for all work-items. Your calculations are not dependent on work-item number or group number or anything, so tempout should always be the same for all work-items.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    My assumption was compiler cannot optimize out my kernel aggressively. Now if you claim the compiler is smart enough to understand the constant behaviour of my kernel, then what would be it's effect on further consideration on SIMD and CU factors, and in general any optimization?

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    There are most likely no further optimizations possible. Does the optimization report say anything? I'd be surprised if it did.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    I don't see any evidence of optimization in the report. Now considering all these, Does vectorization is still happening in the Kernel? I still believe the compiler cannot optimize kernels like the one I've provided. It can only optimize the logic of the code itself.

  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    SIMD vectorization is for the data passed into kernel, only when your input data can be vectorized should it benefit the performance.

    In your code only constant M is passed in and it can't be vectorized, I would guess that's why the resource usage is the same.

    If your goal is to do parallel execution like how it does on GPU, you should experiment with compute unit settings, but it's still not quite the same with GPU in some aspect.

    Bottom line you can launch parallel kernels separately under different kernel name and different queue, this way it's definitely paralleled:p
  • Altera_Forum's avatar
    Altera_Forum
    Icon for Honored Contributor rankHonored Contributor

    The compile does NOT optimize out the computation (it actually isn't smart enough to do that), and it is indeed vectorizing your kernel; however, the compiler can easily tell that your computation does not depend on the work-item ID and hence, does NOT vectorize the computation, but it does vectorize the write to memory which depends on the work-item ID. In this case the compiler creates the logic in a way that the computation is only done once, but copied back SIMD times (in a single coalesced write). The reason why you don not see any difference in the logic utilization is that the difference is so small, it does not reflect in the "percentage" values. If you check the actual numbers in the HTML report, there are small differences in LUT and FF utilization. The difference is of course only caused by the line with the memory write, and the resource utilization for the rest of the lines is exactly the same. Furthermore, You can clearly see in the system viewer that the write port is getting wider.