Altera_Forum
Honored Contributor
8 years agoNDRnage Kernels Global Memory Write Pattern
I came up with some basic questions, which I would like to discuss:
1) I still don't understand how SIMD is being implemented in FPGA. In GPU, workitems being assigned to SIMDs are being executed concurrently. Is it the same case for FPGA? Or they are just going to be interleaved? For example in case of simd 16, Does 16 workitems being scheduled to a compute unit and being executed in interleaved fashion? 2) In case SIMD in FPGA is not as parallel as GPU, does 16 workitems being scheduled at once to a compute unit and wait to finish? or the next work items can still come in and be pushed into the pipeline? 3) Imagine a case where every work-item only writes one value at the end of execution into the global memory, and it writes it to index of "globalid" of that work item. In case of having many compute units and having SIMD of 16, at each clock cycle many write operations will be issued with non-continuous addresses (Based on my understanding). This seems to be inefficient with regards to high performance memory access. Does that mean, kernels designed for GPU are not suitable for FPGA, with regards to their memory access pattern? 4) Does LSU (Load Store Unit) performs memory coalescing? In other words, does it have any kind of buffer to receive memory write operations, and then flush them into the memory after grouping them into multiple continuous blocks of data?