I came up with some basic questions, which I would like to discuss: 1) I still don't understand how SIMD is being implemented in FPGA. In GPU, workitems being assigned to SIMDs are being executed concurrently. Is it the same case for FPGA? Or they are just going to be interleaved? For example in case of simd 16, Does 16 workitems being scheduled to a compute unit and being executed in interleaved fashion? 2) In case SIMD in FPGA is not as parallel as GPU, does 16 workitems being scheduled at once to a compute unit and wait to finish? or the next work items can still come in and be pushed into the pipeline? 3) Imagine a case where every work-item only writes one value at the end of execution into the global memory, and it writes it to index of "globalid" of that work item. In case of having many compute units and having SIMD of 16, at each clock cycle many write operations will be issued with non-continuous addresses (Based on my understanding). This seems to be inefficient with regards to high performance memory access. Does that mean, kernels designed for GPU are not suitable for FPGA, with regards to their memory access pattern? 4) Does LSU (Load Store Unit) performs memory coalescing? In other words, does it have any kind of buffer to receive memory write operations, and then flush them into the memory after grouping them into multiple continuous blocks of data?

1) On the FPGA, by default, a deep pipeline is created so work-items can go in and come out every clock cycle. Optionally, the single pipeline can be vectored to bring more work-items in simultaneously or the entire pipeline can be duplicated to handle different workgroups simultaneously. 2) Depends on the implementation. Without vectorization, a work item goes in and comes out every clock cycle. With vectorization, all 16 can be processed in parallel. The tradeoff is always performance vs. FPGA resource use. 3) You would not want to do this. Better to use a barrier to synchronize all the work-items and write all the work-item data to global memory in one shot. 4) Yes. The compiler will coalesce memory accesses where it can. If it can't, the optimization report will indicate what implementation was selected and why coalescing could not be performed.

Thanks much sstrell, Here some more questions I have: 1) If the value of number of SIMD, really duplicates the pipelines in the compute units, then there should be a significant difference in resource usage between SIMD 8 and SIMD16. Looking at the resource usage, it tells me increasing SIMD value does not necessarily increases the amount of resource usage. Does it mean SIMD pipeline replication is efficient in terms of area growth? 3) Can you elaborate more on this? Isn't barrier introduces severe performance penalty though? 4) So you are telling there is no runtime mechanism for memory coalescing, and it's all compile time. Is that true? Cause I think for GPU is the other way.

1) No. You should try vectorization (num_simd_work_items) first before CU replication (num_compute_units). Both use more resources, but num_simd_work_items will use less. 3) The penalty of syncing work-items before performing a memory access is much less than constant calls to global. Again, check the optimization report and use the profiler to see the affects on your design. 4) All pipeline hardware is created with the offline compile so the choice of load/store units is done at that point as well, including whether coalescing can be performed or not.

1) Alright, I totally understand the difference between num_simd_work_items and num_compute_units. What I don't understand is, how SIMDs are being implemented to achieve parallelism and low resource consumption. By low I mean really really low. I barely see increase or decrease in area by playing with the value of num_simd_work_items. That's why I came up with the conclusion that num_compute_units achieves real parallelism and num_simd_work_items just interleaves work item one after the other. 2) Can you elaborate more on Barrier implementation? I believe even after all workitems in the workgroup hit the barrier, then they should do their write operations one after the other. I doubt after hitting barrier all 256 workitems in the workgroup can execute their write instruction in the OpenCL code.

1) With vectorization, all the control logic is shared, minimizing the extra resources required, so more work-items go into the pipeline in parallel. It is still pipelined, so if you vectorize to, say, 16, you can still put in 16 work-items each clock cycle and get 16 work-items out each clock cycle. If the pipeline is not vectorized, work-items go in the pipeline one at a time, but you get a constant throughput of 1 work-item per clock cycle assuming no stalls in the pipeline. 2) No, the barrier works just as you say. All work-items pause at the barrier, usually done before a memory operation. The hardware is implemented to handle this. Remember, you're creating FPGA hardware, which is completely customized on the way you write the code, so if you use a barrier and then perform a memory operation, LSUs are selected to handle this in the most efficient way possible (coalescence). Thus, in most cases, you want to also specify a required or maximum number of work-items since this tells the compiler how to organize the LSUs to be most efficient.

NDRnage Kernels Global Memory Write Pattern

16 Replies

Altera_Forum
Honored Contributor
8 years ago
1) On the FPGA, by default, a deep pipeline is created so work-items can go in and come out every clock cycle. Optionally, the single pipeline can be vectored to bring more work-items in simultaneously or the entire pipeline can be duplicated to handle different workgroups simultaneously.

2) Depends on the implementation. Without vectorization, a work item goes in and comes out every clock cycle. With vectorization, all 16 can be processed in parallel. The tradeoff is always performance vs. FPGA resource use.

3) You would not want to do this. Better to use a barrier to synchronize all the work-items and write all the work-item data to global memory in one shot.

4) Yes. The compiler will coalesce memory accesses where it can. If it can't, the optimization report will indicate what implementation was selected and why coalescing could not be performed.
Altera_Forum
Honored Contributor
8 years ago
Thanks much sstrell,

Here some more questions I have:

1) If the value of number of SIMD, really duplicates the pipelines in the compute units, then there should be a significant difference in resource usage between SIMD 8 and SIMD16. Looking at the resource usage, it tells me increasing SIMD value does not necessarily increases the amount of resource usage. Does it mean SIMD pipeline replication is efficient in terms of area growth?

3) Can you elaborate more on this? Isn't barrier introduces severe performance penalty though?

4) So you are telling there is no runtime mechanism for memory coalescing, and it's all compile time. Is that true? Cause I think for GPU is the other way.
Altera_Forum
Honored Contributor
8 years ago
1) No. You should try vectorization (num_simd_work_items) first before CU replication (num_compute_units). Both use more resources, but num_simd_work_items will use less.

3) The penalty of syncing work-items before performing a memory access is much less than constant calls to global. Again, check the optimization report and use the profiler to see the affects on your design.

4) All pipeline hardware is created with the offline compile so the choice of load/store units is done at that point as well, including whether coalescing can be performed or not.
Altera_Forum
Honored Contributor
8 years ago
1) Alright, I totally understand the difference between num_simd_work_items and num_compute_units. What I don't understand is, how SIMDs are being implemented to achieve parallelism and low resource consumption. By low I mean really really low. I barely see increase or decrease in area by playing with the value of num_simd_work_items. That's why I came up with the conclusion that num_compute_units achieves real parallelism and num_simd_work_items just interleaves work item one after the other.

2) Can you elaborate more on Barrier implementation? I believe even after all workitems in the workgroup hit the barrier, then they should do their write operations one after the other. I doubt after hitting barrier all 256 workitems in the workgroup can execute their write instruction in the OpenCL code.
Altera_Forum
Honored Contributor
8 years ago
1) With vectorization, all the control logic is shared, minimizing the extra resources required, so more work-items go into the pipeline in parallel. It is still pipelined, so if you vectorize to, say, 16, you can still put in 16 work-items each clock cycle and get 16 work-items out each clock cycle. If the pipeline is not vectorized, work-items go in the pipeline one at a time, but you get a constant throughput of 1 work-item per clock cycle assuming no stalls in the pipeline.

2) No, the barrier works just as you say. All work-items pause at the barrier, usually done before a memory operation. The hardware is implemented to handle this. Remember, you're creating FPGA hardware, which is completely customized on the way you write the code, so if you use a barrier and then perform a memory operation, LSUs are selected to handle this in the most efficient way possible (coalescence). Thus, in most cases, you want to also specify a required or maximum number of work-items since this tells the compiler how to organize the LSUs to be most efficient.
Altera_Forum
Honored Contributor
8 years ago
Hi,

I have done some compilation on the same kernel and just played with the SIMD number (1, 2, 4, 8, 16) and watched over the resource utilization percentage. Here is the result I've got for every compilation:

16 SIMD:
+---------------------------------------------------------------------+
; Estimated Resource Usage Summary ;
+----------------------------------------+---------------------------+
; Resource + Usage ;
+----------------------------------------+---------------------------+
; Logic utilization ; 33% ;
; ALUTs ; 13% ;
; Dedicated logic registers ; 20% ;
; Memory blocks ; 12% ;
; DSP blocks ; 6% ;
+----------------------------------------+---------------------------;

8 SIMD:
+---------------------------------------------------------------------+
; Estimated Resource Usage Summary ;
+----------------------------------------+---------------------------+
; Resource + Usage ;
+----------------------------------------+---------------------------+
; Logic utilization ; 33% ;
; ALUTs ; 13% ;
; Dedicated logic registers ; 20% ;
; Memory blocks ; 12% ;
; DSP blocks ; 6% ;
+----------------------------------------+---------------------------;

4 SIMD:
+---------------------------------------------------------------------+
; Estimated Resource Usage Summary ;
+----------------------------------------+---------------------------+
; Resource + Usage ;
+----------------------------------------+---------------------------+
; Logic utilization ; 33% ;
; ALUTs ; 13% ;
; Dedicated logic registers ; 20% ;
; Memory blocks ; 12% ;
; DSP blocks ; 6% ;
+----------------------------------------+---------------------------;

2 SIMD:
+---------------------------------------------------------------------+
; Estimated Resource Usage Summary ;
+----------------------------------------+---------------------------+
; Resource + Usage ;
+----------------------------------------+---------------------------+
; Logic utilization ; 33% ;
; ALUTs ; 13% ;
; Dedicated logic registers ; 20% ;
; Memory blocks ; 12% ;
; DSP blocks ; 6% ;
+----------------------------------------+---------------------------;

1 SIMD:
+---------------------------------------------------------------------+
; Estimated Resource Usage Summary ;
+----------------------------------------+---------------------------+
; Resource + Usage ;
+----------------------------------------+---------------------------+
; Logic utilization ; 33% ;
; ALUTs ; 13% ;
; Dedicated logic registers ; 20% ;
; Memory blocks ; 12% ;
; DSP blocks ; 6% ;
+----------------------------------------+---------------------------;

I still don't see any difference in utilization. Can you please elaborate more how different size for SIMD changes the area usage? I still feel like it's not clear how things are being mapped deep in the FPGA.
Altera_Forum
Honored Contributor
8 years ago
You should post your kernel code; what you are observing is very likely because you have something in your code that is preventing vectorization (e.g. work-item ID-dependent branching) and hence, SIMD is not being applied at all. If this is indeed the case, you will see a message form the compiler about this in the log.

Altera_Forum

Honored Contributor

8 years ago

Here is the kernel I compile:

__attribute__((num_compute_units(5)))
__attribute__((num_simd_work_items(16)))
__attribute__((reqd_work_group_size(256,1,1)))
__kernel void WGSXMAPIXLLXOPS16(const __global float * restrict GIn, __global float * restrict GOut, const float M, const float N, const float P) {
    const int XGL = get_global_id(0);
    const int XGRid = get_group_id(0);
    const int XGRnum = get_num_groups(0);
    const int XLSize = get_local_size(0);
    const int XLid = get_local_id(0);
    // Just a private variable
    float MF = (float) M;
    float NF = (float) N;
    float PF = (float) P;
    // Start of a new level of for loop
    long baseIndex1 = XGRid*XLSize*2+XLid;
    float temp1 = 1.0;
    float temp2 = 1.0;
    float temp3 = 1.0;
    float temp4 = 1.0;
    float tempOut;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    tempOut = temp1 + temp2 + temp3 + temp4;
    GOut = tempOut;
}

Basically I use it for some performance measurement and I made sure no optimization is going to happen.

Altera_Forum
Honored Contributor
8 years ago
You're not even using GIn. You're just doing math on constant values. tempout is the same value for all work-items. Am I missing something here?
Altera_Forum
Honored Contributor
8 years ago
Yes I'm not reading anything from memory, since I want to role out memory effect as much as possible. But I don't understand why tempout is the same for all! it's a private variable in the kernel. Beside, none of the operations perform on constant values. As a result compiler cannot optimize them out. I have checked the verilog code to make sure all fmas are out there.

Forum Discussion

NDRnage Kernels Global Memory Write Pattern

16 Replies

Recent Discussions

Quartus 13.1 including Signal Tap License

how to reduce clock skew between synchronous clock

I do not get an eMail with the generated license file

Installer cannot establish connection with SSL error

Quartus 20.1std compilation fails for Quartus map - Device 10AS057K2F40I1SG