I came up with some basic questions, which I would like to discuss: 1) I still don't understand how SIMD is being implemented in FPGA. In GPU, workitems being assigned to SIMDs are being executed concurrently. Is it the same case for FPGA? Or they are just going to be interleaved? For example in case of simd 16, Does 16 workitems being scheduled to a compute unit and being executed in interleaved fashion? 2) In case SIMD in FPGA is not as parallel as GPU, does 16 workitems being scheduled at once to a compute unit and wait to finish? or the next work items can still come in and be pushed into the pipeline? 3) Imagine a case where every work-item only writes one value at the end of execution into the global memory, and it writes it to index of "globalid" of that work item. In case of having many compute units and having SIMD of 16, at each clock cycle many write operations will be issued with non-continuous addresses (Based on my understanding). This seems to be inefficient with regards to high performance memory access. Does that mean, kernels designed for GPU are not suitable for FPGA, with regards to their memory access pattern? 4) Does LSU (Load Store Unit) performs memory coalescing? In other words, does it have any kind of buffer to receive memory write operations, and then flush them into the memory after grouping them into multiple continuous blocks of data?

1) On the FPGA, by default, a deep pipeline is created so work-items can go in and come out every clock cycle. Optionally, the single pipeline can be vectored to bring more work-items in simultaneously or the entire pipeline can be duplicated to handle different workgroups simultaneously. 2) Depends on the implementation. Without vectorization, a work item goes in and comes out every clock cycle. With vectorization, all 16 can be processed in parallel. The tradeoff is always performance vs. FPGA resource use. 3) You would not want to do this. Better to use a barrier to synchronize all the work-items and write all the work-item data to global memory in one shot. 4) Yes. The compiler will coalesce memory accesses where it can. If it can't, the optimization report will indicate what implementation was selected and why coalescing could not be performed.

Thanks much sstrell, Here some more questions I have: 1) If the value of number of SIMD, really duplicates the pipelines in the compute units, then there should be a significant difference in resource usage between SIMD 8 and SIMD16. Looking at the resource usage, it tells me increasing SIMD value does not necessarily increases the amount of resource usage. Does it mean SIMD pipeline replication is efficient in terms of area growth? 3) Can you elaborate more on this? Isn't barrier introduces severe performance penalty though? 4) So you are telling there is no runtime mechanism for memory coalescing, and it's all compile time. Is that true? Cause I think for GPU is the other way.

1) No. You should try vectorization (num_simd_work_items) first before CU replication (num_compute_units). Both use more resources, but num_simd_work_items will use less. 3) The penalty of syncing work-items before performing a memory access is much less than constant calls to global. Again, check the optimization report and use the profiler to see the affects on your design. 4) All pipeline hardware is created with the offline compile so the choice of load/store units is done at that point as well, including whether coalescing can be performed or not.

1) Alright, I totally understand the difference between num_simd_work_items and num_compute_units. What I don't understand is, how SIMDs are being implemented to achieve parallelism and low resource consumption. By low I mean really really low. I barely see increase or decrease in area by playing with the value of num_simd_work_items. That's why I came up with the conclusion that num_compute_units achieves real parallelism and num_simd_work_items just interleaves work item one after the other. 2) Can you elaborate more on Barrier implementation? I believe even after all workitems in the workgroup hit the barrier, then they should do their write operations one after the other. I doubt after hitting barrier all 256 workitems in the workgroup can execute their write instruction in the OpenCL code.

1) With vectorization, all the control logic is shared, minimizing the extra resources required, so more work-items go into the pipeline in parallel. It is still pipelined, so if you vectorize to, say, 16, you can still put in 16 work-items each clock cycle and get 16 work-items out each clock cycle. If the pipeline is not vectorized, work-items go in the pipeline one at a time, but you get a constant throughput of 1 work-item per clock cycle assuming no stalls in the pipeline. 2) No, the barrier works just as you say. All work-items pause at the barrier, usually done before a memory operation. The hardware is implemented to handle this. Remember, you're creating FPGA hardware, which is completely customized on the way you write the code, so if you use a barrier and then perform a memory operation, LSUs are selected to handle this in the most efficient way possible (coalescence). Thus, in most cases, you want to also specify a required or maximum number of work-items since this tells the compiler how to organize the LSUs to be most efficient.

NDRnage Kernels Global Memory Write Pattern

16 Replies

Altera_Forum
Honored Contributor
8 years ago
M, N, and P are all constants and MF is basically the same as M. You're doing the same math on the same inputs for all work-items. Your calculations are not dependent on work-item number or group number or anything, so tempout should always be the same for all work-items.
Altera_Forum
Honored Contributor
8 years ago
My assumption was compiler cannot optimize out my kernel aggressively. Now if you claim the compiler is smart enough to understand the constant behaviour of my kernel, then what would be it's effect on further consideration on SIMD and CU factors, and in general any optimization?
Altera_Forum
Honored Contributor
8 years ago
There are most likely no further optimizations possible. Does the optimization report say anything? I'd be surprised if it did.
Altera_Forum
Honored Contributor
8 years ago
I don't see any evidence of optimization in the report. Now considering all these, Does vectorization is still happening in the Kernel? I still believe the compiler cannot optimize kernels like the one I've provided. It can only optimize the logic of the code itself.
Altera_Forum
Honored Contributor
8 years ago
SIMD vectorization is for the data passed into kernel, only when your input data can be vectorized should it benefit the performance.
In your code only constant M is passed in and it can't be vectorized, I would guess that's why the resource usage is the same.

If your goal is to do parallel execution like how it does on GPU, you should experiment with compute unit settings, but it's still not quite the same with GPU in some aspect.
Bottom line you can launch parallel kernels separately under different kernel name and different queue, this way it's definitely paralleled:p
Altera_Forum
Honored Contributor
8 years ago
The compile does NOT optimize out the computation (it actually isn't smart enough to do that), and it is indeed vectorizing your kernel; however, the compiler can easily tell that your computation does not depend on the work-item ID and hence, does NOT vectorize the computation, but it does vectorize the write to memory which depends on the work-item ID. In this case the compiler creates the logic in a way that the computation is only done once, but copied back SIMD times (in a single coalesced write). The reason why you don not see any difference in the logic utilization is that the difference is so small, it does not reflect in the "percentage" values. If you check the actual numbers in the HTML report, there are small differences in LUT and FF utilization. The difference is of course only caused by the line with the memory write, and the resource utilization for the rest of the lines is exactly the same. Furthermore, You can clearly see in the system viewer that the write port is getting wider.

Forum Discussion

NDRnage Kernels Global Memory Write Pattern

16 Replies

Recent Discussions

how to reduce clock skew between synchronous clock

Quartus 13.1 including Signal Tap License

I do not get an eMail with the generated license file

Installer cannot establish connection with SSL error

Quartus 20.1std compilation fails for Quartus map - Device 10AS057K2F40I1SG