User Profile

kkvasan

New Contributor

Joined 4 years ago

14 Posts

View All Badges

User Widgets

Contributions

Re: OneAPi: Iterative read write with swapped memory locations
Hi @BoonBengT_Altera , Have a good day you too! Yes, I got my doubts clarified and now I am able to implement the target design. Kind Regards, Vasan
4 years ago Place Acceleration
2.1KViews
0likes
0Comments
Re: OneAPi: Iterative read write with swapped memory locations
Hi Yohann, Yes, Using pipes for Kernel to Kernel communication makes life easier. It seams tweaking code with little bit extra global memory helps to avoid stalling by pipe read and write. template <int idx1, int idx2> int g_read_write(const rAcc &ptrR1, const wAcc &ptrW1, int total_itr, int delay){ [[intel::ivdep]] [[intel::initiation_interval(1)]] for(int i = 0; i < total_itr+delay; i++){ struct dPath16 vec1 = ptrR1[i+delay]; if(i < total_itr){ pipeS::PipeAt<idx1>::write(vec1); } struct dPath16 vecW1; // = pipeS::PipeAt<idx2>::read(); if(i >= delay){ vecW1 = pipeS::PipeAt<idx2>::read();; } ptrW1[i] = vecW1; //vecW1;; } return 0; } By having a required depth for pipes and delay value, we can avoid stalling due to pipe read and write. it will cost delay*sizeof(dPath16) byes of additional global memory at the beginning of buffer. this function can be called inside the iterative loop. Many Thanks, Vasan
4 years ago Place Acceleration
2.2KViews
0likes
0Comments
Re: OneAPi: Iterative read write with swapped memory locations
Hi Yuguen, Thanks for the advice. I have to identify the way to transfer data between kernels through local memory as there is a big compute kernel pipeline consisting of 10s of kernels. will try your suggestion! Many Thanks, Vasan
4 years ago Place Acceleration
2.2KViews
0likes
0Comments
Re: OneAPi: Iterative read write with swapped memory locations
Hi Yuguen, Thanks for the reply. Agree that DDR read should be stall enabled and DDR write should also be stall enabled. But I want independent DDR read and write clusters for the inner loop since there is no dependency. It seems like DDR read, Pipe Write, Pipe Read, and DDR write come under the same stall-enabled cluster. Is there any way to make DDR read and Pipe Write as a separate stall enabled cluster and Pipe read and DDR write to another stall enabled cluster? DDR read shouldn’t stall DDR write Basically what I want is, a chunk of data needs to be read (size could be larger than on-chip memory), processed(there is a kernel pipeline), and written back. all these should happen in parallel. on the next iteration, the read and write location should be swapped. The above code tries to implement the read and write-back of the results in an iterative loop swapping memory locations. implementing mem read and mem write in the separate kernel doesn't allow swapping as a buffer should be used in one kernel in non-USM designs. when I tried this, the design hangs. Is there any way to implement this? any suggestions/ advice is highly appreciated Many Thanks, Vasan
4 years ago Place Acceleration
2.2KViews
0likes
0Comments
Re: OneAPi: Iterative read write with swapped memory locations
Hi Yohann, Thanks for the reply. Did a few experiments, it seems like above the loop is mapped to a stall-enabled cluster. There is a latency in read data getting through processing kernel and returning to the pipe for mem write . within this latency entire cluster stop for each iteration. when modifying processing kernel such that it just pop the data and push a random data to write pipe, I am getting expected performance. is there a way to make mem read cluster and mem write cluster stall-free as in following? https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/introduction-to-fpga-design-concepts/scheduling/clustering-the-datapath.html Kind regards, Vasan
4 years ago Place Acceleration
2.2KViews
0likes
0Comments
OneAPi: Iterative read write with swapped memory locations
Hi All, I am using oneAPI to implement an application on Arria 10 GX acceleration card for my research work. There is a long kernel pipeline and input and output memory locations should be swapped for each iteration. Initially read and write loops were separate kernels but by that i can't synchronise the memory read and write for multiple iterations. Hence merged the read and write into one nested loop as follows. [[intel::max_concurrency(1)]] for(int itr = 0; itr < 2*n_iter; itr++){ accessor ptrR1 = (itr & 1) == 0 ? in1 : out1; accessor ptrW1 = (itr & 1) == 1 ? in1 : out1; auto input_ptr = ptrR1.get_pointer(); auto output_ptr = ptrW1.get_pointer(); [[intel::initiation_interval(1)]] [[intel::ivdep]] [[intel::max_concurrency(0)]] for(int i = 0; i < total_itr; i++){ vec1 = ptrR1[i]; pipeS::PipeAt<idx1>::write(vec1); vecW1 = pipeS::PipeAt<idx2>::read(); ptrW1[i] = vecW1; //vecW1; } } This one works but i am getting reduced performance. around 8 times less bandwidth than expected. same inner loop without pipes, just copying data to write location gives expected performance. any suggestion/ advice to fix the performance issue is appreciated Many Thanks, Vasan
4 years ago Place Acceleration
2.3KViews
0likes
11Comments
Re: Stratix 10 oneapi: Kernel CLK vs clock 2x
Thank you so much @HRZ for the detailed information. would be better if we can double pump DSPs as well. Vasan
4 years ago Place Acceleration
1.4KViews
0likes
1Comment
Re: Stratix 10 Oneapi nodes: Error enumerating resources
Hi BB, Thanks for the reply. Pac_s10 wasn't detected in Stratix nodes last Friday but it is now Visible in those nodes. Many Thanks, Vasan
4 years ago Place Acceleration
1.2KViews
0likes
0Comments
Stratix 10 oneapi: Kernel CLK vs clock 2x
Hi All, I am new to IntelFPGAs and learning about performance optimizations using oneAPI. read about HyperFlex routing Optimisation for Stratix 10 FPGAs. It says we can get 2x core performance as it helps the design to operate at a higher frequency. I compiled my design using dpcpp targetting pac_s10_usm acceleration board and when checking the report there are two clocks given. Kernel clock and clock 2x which are 274 MHz and 548 MHz. But when measuring the throughput, it seems the design operates at a kernel clock 274 MHz. Is it possible to run part of the design with clock 2x using oneAPI? Many Thanks, Vasan
Solved
4 years ago Place Acceleration
1.5KViews
0likes
4Comments
Stratix 10 Oneapi nodes: Error enumerating resources
Hi All, Today I am again not able to run applications on Stratix 10 oneapi nodes. I tried all S10 oneapi nodes, but everything giver similar error as follows, Error enumerating AFCs: not found Segmentation fault tried fpgainfo as well, it also says Error enumerating resources: not found tried to initialize the board as well. but it wasn’t successful. gave a similar error. any suggestions/advice to fix this error is appreciated? Many Thanks, Vasan
4 years ago Place Acceleration
1.2KViews
0likes
3Comments