Data level parallelism on FPGA with kernel replication using oneAPI
Hi,
I'm playing with kernel replication on FPGA using oneAPI. There is a tutorial on kernel replication here, but it is exploiting pipeline parallelism whereas I want to exploit data-level parallelism. Say I have two identical kernels but one process half the data and the other the other half. Here is my attempt with a vector add example, but I would like to ask for advice in case there is a better way to do the same:
using IntVector = std::vector<int,oneapi::tbb::cache_aligned_allocator<int>>; template<bool HBM_enabled, int Replica, int NumRep, int unroll_factor> sycl::event VectorAdd(queue &q, const IntVector &a_vector, const IntVector &b_vector, IntVector &sum_parallel) { // Create the range object for the vectors managed by the buffer. size_t num_items{a_vector.size()}; int begin = Replica * num_items / NumRep; int end = (Replica +1) * num_items / NumRep; // Create buffers that hold the data shared between the host and the devices. // The buffer destructor is responsible to copy the data back to host when it // goes out of scope. buffer a_buf{a_vector.begin()+begin, a_vector.begin()+end}; buffer b_buf{b_vector.begin()+begin, b_vector.begin()+end}; buffer sum_buf{sum_parallel.begin()+begin, sum_parallel.begin()+end}; sum_buf.set_final_data(sum_parallel.begin()+begin); sum_buf.set_write_back(); // Submit a command group to the queue by a lambda function that contains the // data access permission and device computation (kernel). auto e = q.submit([&](handler &h) { if constexpr (HBM_enabled){ ext::oneapi::accessor_property_list PL0{ext::intel::buffer_location<Replica*3>}; ext::oneapi::accessor_property_list PL1{ext::intel::buffer_location<Replica*3+1>}; ext::oneapi::accessor_property_list PL2{no_init,ext::intel::buffer_location<Replica*3+2>}; accessor a{a_buf, h, read_only, PL0}; accessor b{b_buf, h, read_only, PL1}; accessor sum{sum_buf, h, write_only, PL2}; h.single_task<VAdd<HBM_enabled,Replica,unroll_factor>>([=]() [[intel::kernel_args_restrict]]{ #pragma unroll unroll_factor for (size_t i = 0; i < end-begin; i++) sum[i] = a[i] + b[i]; }); } else{ accessor a{a_buf, h, read_only}; accessor b{b_buf, h, read_only}; // The sum_accessor is used to store (with write permission) the sum data. accessor sum{sum_buf, h, write_only, no_init}; h.single_task<VAdd<HBM_enabled,Replica,unroll_factor>>([=]() [[intel::kernel_args_restrict]]{ #pragma unroll unroll_factor for (size_t i = 0; i < end-begin; i++) sum[i] = a[i] + b[i]; }); } }); return e; }
Then I can create two replicas in the main() function with:
auto e0 = VectorAdd<true,0,2,4>(q, a, b, sum_parallel); auto e1 = VectorAdd<true,1,2,4>(q, a, b, sum_parallel); q.wait();
Where "q" is the FPGA queue, and I'm doing sum_parallel = a + b (of type IntVector).
The issue I found is that this comment:
// The buffer destructor is responsible to copy the data back to host when it // goes out of scope.
was true if I initialize an output buffer with the whole vector:
buffer sum_buf{sum_parallel};
but not if I initialize the buffers of each kernel replica with a region/block of the original vector of ints:
buffer sum_buf{sum_parallel.begin()+begin, sum_parallel.begin()+end};
In the latter case, the original output vector, sum_parallel, is not updated after the kernels are done (every position holds a 0). I found that adding these two lines solves the problem:
sum_buf.set_final_data(sum_parallel.begin()+begin); sum_buf.set_write_back();
and I imagine that they are required because the runtime is not sure about the sum_buf blocks overlapping in the original vector and conservatively disable the copy_back on buffer destruction. However I couldn't find any comment on this on the oneAPI guides, DPC++ book or examples. So I wanted to share it with you in case you know a better solution or have a piece of advice.
Thank you very much in advance,
Rafa.
Hi asenjo,
Sorry for late reply, I managed to consult one of my respective team member into your question. Based on your written code, the buffers go out of scope at the end of the VectorAdd() function and the kernels get serialized instead in running in parallel.
The main() would look something like this:
1. buffer a_buf1{a_vector.begin()+begin1, a_vector.begin()+end1};
2. buffer b_buf1{b_vector.begin()+begin1, b_vector.begin()+end1};
3. buffer sum_buf1{sum_parallel.begin()+begin1, sum_parallel.begin()+end1};
4.
5. buffer a_buf2{a_vector.begin()+begin1, a_vector.begin()+end2};
6. buffer b_buf2{b_vector.begin()+begin1, b_vector.begin()+end2};
7. buffer sum_buf2{sum_parallel.begin()+begin1, sum_parallel.begin()+end2};
8.
9. auto e0 = VectorAdd<true,0,2,4>(q, a_buf1, b_buf1, sum_buf1);
10. auto e1 = VectorAdd<true,1,2,4>(q, a_buf2, b_buf2, sum_buf2);
11. q.wait();
Another option is to use sub buffers:
Another option is to use USM but but then the user is responsible to copy data back and forth themselves:
https://www.intel.com/content/www/us/en/developer/articles/code-sample/vector-add.html
Thanks.
Regards,
Aik Eu