Hi All, I am using oneAPI to implement an application on Arria 10 GX acceleration card for my research work. There is a long kernel pipeline and input and output memory locations should be swapped for each iteration. Initially read and write loops were separate kernels but by that i can't synchronise the memory read and write for multiple iterations. Hence merged the read and write into one nested loop as follows. [[intel::max_concurrency(1)]] for(int itr = 0; itr < 2*n_iter; itr++){ accessor ptrR1 = (itr & 1) == 0 ? in1 : out1; accessor ptrW1 = (itr & 1) == 1 ? in1 : out1; auto input_ptr = ptrR1.get_pointer(); auto output_ptr = ptrW1.get_pointer(); [[intel::initiation_interval(1)]] [[intel::ivdep]] [[intel::max_concurrency(0)]] for(int i = 0; i < total_itr; i++){ vec1 = ptrR1[i]; pipeS::PipeAt<idx1>::write(vec1); vecW1 = pipeS::PipeAt<idx2>::read(); ptrW1[i] = vecW1; //vecW1; } } This one works but i am getting reduced performance. around 8 times less bandwidth than expected. same inner loop without pipes, just copying data to write location gives expected performance. any suggestion/ advice to fix the performance issue is appreciatedMany Thanks,Vasan

Hey Vasan, Can you share a report of both versions of your code? (no quartus compile run) That would be helpful for identifying why you are seeing a throughput drop. If there is no difference in the reports, it may be that your pipe operations are blocked (trying to read to an empty pipe/to write to a full pipe). Yohann

Hi Yohann, Thanks for the reply. Did a few experiments, it seems like above the loop is mapped to a stall-enabled cluster. There is a latency in read data getting through processing kernel and returning to the pipe for mem write . within this latency entire cluster stop for each iteration. when modifying processing kernel such that it just pop the data and push a random data to write pipe, I am getting expected performance. is there a way to make mem read cluster and mem write cluster stall-free as in following? https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/introduction-to-fpga-design-concepts/scheduling/clustering-the-datapath.html Kind regards, Vasan

From what I can see in this code snippet, your inner loop reads and writes to DDR + does blocking pipes operations: therefore it needs to be in a stall-enabled cluster as both the DDR and the pipes may stall your kernel. If you want to have a stall-free compute loop, you'll want to remove both the DDR accesses and the pipe operations. If I understand correctly your issue, what stalls your compute kernel are the memory accesses and not the pipe operations? In that case you may want to copy the relevant data you want to compute on in a local memory, make your compute kernel compute on that local memory and produce its results to another local memory. The results local memory can then be copied back to DDR.

Hi Yuguen, Thanks for the reply. Agree that DDR read should be stall enabled and DDR write should also be stall enabled. But I want independent DDR read and write clusters for the inner loop since there is no dependency. It seems like DDR read, Pipe Write, Pipe Read, and DDR write come under the same stall-enabled cluster. Is there any way to make DDR read and Pipe Write as a separate stall enabled cluster and Pipe read and DDR write to another stall enabled cluster? DDR read shouldn’t stall DDR write Basically what I want is, a chunk of data needs to be read (size could be larger than on-chip memory), processed(there is a kernel pipeline), and written back. all these should happen in parallel. on the next iteration, the read and write location should be swapped. The above code tries to implement the read and write-back of the results in an iterative loop swapping memory locations. implementing mem read and mem write in the separate kernel doesn't allow swapping as a buffer should be used in one kernel in non-USM designs. when I tried this, the design hangs. Is there any way to implement this? any suggestions/ advice is highly appreciated Many Thanks,Vasan

If I understand your problem correctly, I would loop over: 1/ having a loop reading a part of DDR and storing the data to two local memories for both of your accessors 2/ computing on these local memories 3/ having a loop writing the two local memories back to DDR. If you have enough private copies of the local memories, the compiler will schedule 1/ 2/ and 3/ in parallel. So while you are computing 2/, another part of the DDR is being read and the previously computed local memory is being written to DDR. Having these local mem, you should never stall because of DDR (assuming your kernel is compute bound).

kkvasan

New Contributor

4 years ago

OneAPi: Iterative read write with swapped memory locations

Hi All,

I am using oneAPI to implement an application on Arria 10 GX acceleration card for my research work. There is a long kernel pipeline and input and output memory locations should be swapped for each iteration. Initially read and write loops were separate kernels but by that i can't synchronise the memory read and write for multiple iterations. Hence merged the read and write into one nested loop as follows.

        [[intel::max_concurrency(1)]]
        for(int itr = 0; itr < 2*n_iter; itr++){
          accessor ptrR1 = (itr & 1) == 0 ? in1 : out1;
          accessor ptrW1 = (itr & 1) == 1 ? in1 : out1;

          auto input_ptr = ptrR1.get_pointer();
          auto output_ptr = ptrW1.get_pointer();

          [[intel::initiation_interval(1)]]
          [[intel::ivdep]]
          [[intel::max_concurrency(0)]]
          for(int i = 0; i < total_itr; i++){
            vec1 = ptrR1[i];
            pipeS::PipeAt<idx1>::write(vec1);


            vecW1 = pipeS::PipeAt<idx2>::read();
            ptrW1[i] = vecW1; //vecW1;

          }

        }

This one works but i am getting reduced performance. around 8 times less bandwidth than expected. same inner loop without pipes, just copying data to write location gives expected performance. any suggestion/ advice to fix the performance issue is appreciated

Many Thanks,
Vasan

Forum Discussion

OneAPi: Iterative read write with swapped memory locations

11 Replies

Recent Discussions

AI Suite System Throughput Issue

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

How Do I get the License for HLS?

Deprecation Notice for FPGA Support Package for oneAPI DPC++/C++. What is the alternative?