ContributionsMost RecentMost LikesSolutionsRe: parallel_for very slow in dpc++ Thank you so much @BoonBengT_Altera ! Zipping files using python notebooks worked. I don't have additional inquiries. Thank you! Re: Weird behavior of dpc++ code after running it on FPGA device Hello Aik Eu, Thank you for your reply, I removed the no_init from the accessor and is now working correctly as expected. this line: accessor dif(res_buf, h, write_only, no_init); became this: accessor dif(res_buf, h, write_only); Thank you! Re: parallel_for very slow in dpc++ Thank you for your support Yohann! Finally, one last thing, it is not possible to view the report on Intel Devcloud as it is not allowed to install a browser like firefox. And viewing it through Jupyter Notebook on the cloud gives an empty HTML file. How to view the report? Thank you! Re: invalid work group size error, dpc++ code running on Intel Arria 10 oneAPI on devcloud Hello Aik Eu! I wanted speed efficiency, I tried to split the 16,000 samples (each contains 5 features, double precision) into smaller chunks. But it didn't work. Thank you! Re: Exception in running vector-add on FPGA device on arria10 devcloud using the command dpcpp -fintelfpga -Xshardware <source_file>.cpp Solved the problem, make fpga didn't work on any node. probably the problem is in my side. Thank you! Re: parallel_for very slow in dpc++ Hello @BoonBengT_Altera I still have doubts regarding the hardware run of the FPGA. I don't understand why a small parallel operation may take milliseconds while the FPGA runs on more than 500 MHz. Even after I specify the clock flga rate: –Xsclock=500MHz nothing changes, it still runs with milliseconds where host code runs way faster. Isn't FPGA for acceleration? And the FPGA code above is very basic. And also, is there a way to know how many clock cycles the kernel code took, is it equivalent to the latency in the report? The latency in the report is 343, but without units, what does 343 mean, is it the number of clock cycles for example? 2- The segmentation fault problem was solved, nothing wrong with the hardware run except the amount of time it takes compared to a normal c++ code. 3- In my experiment, single_task with a loop inside took almost the same time as parallel_for without a loop inside. which means an iterative code takes as much as the parallel one in the hardware. Is this normal? Thank you! Sorting a vector on FPGA device using dpc++ Hello, I want to sort a vector using DPC++, but on an FPGA device in parallel. The merge sort example on reference designs for DPC++ FPGA is very complex and I can't seem to understand it. I would like to make my own merge sort, but I don't know how to merge the values in accessors for example and is it possible to use built-in functions like swap() and max() inside FPGA parallel_for? Thank you! Weird behavior of dpc++ code after running it on FPGA device Hello, I am using DPC++ to accelerate knn algorithm on FPGA device. The following code is the code I wrote for the euclidean distance. The problem is that the fpga_emulation works very well with no problems while running it on fpga hardware (Intel Arria 10 OneAPI) gives -nan for all values in the resulting buffer, which means something got wrong in the parallel_for lioop. But I can't find anything wrong about it and the emulation worked. I am using Intel Devcloud platform. std::vector<double> distance_calculation_FPGA(queue& q, const std::vector<std::vector<double>>& dataset, const std::vector<double>& curr_test) { std::cout<<"convert 2D to 1D"<<std::endl; std::vector<double>linear_dataset; for (int i = 0; i < dataset.size(); ++i) { for (int j = 0; j < dataset[i].size(); ++j) { linear_dataset.push_back(dataset[i][j]); } } std::cout<<"buffering"<<std::endl; range<1> num_items{dataset.size()}; std::vector<double>res; //std::cout << "im in" << std::endl; res.resize(dataset.size()); buffer dataset_buf(linear_dataset); buffer curr_test_buf(curr_test); buffer res_buf(res.data(), num_items); std::cout<<"submit a job"<<std::endl; auto start = std::chrono::high_resolution_clock::now(); { q.submit([&](handler& h) { accessor a(dataset_buf, h, read_only); accessor b(curr_test_buf, h, read_only); accessor dif(res_buf, h, write_only, no_init); h.parallel_for(num_items, [=](auto i) { for (int j = 0; j < 5; ++j) { dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]); } // out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl; }); }).wait(); } auto finish = std::chrono::high_resolution_clock::now(); std::chrono::duration<double> elapsed = finish - start; std::cout << "Elapsed time: " << elapsed.count() << " s\n"; /* for (int i = 0; i < dataset.size(); ++i) { double dis = 0; for (int j = 0; j < dataset[i].size(); ++j) { dis += (curr_test[j] - dataset[i][j]) * (curr_test[j] - dataset[i][j]); } res.push_back(dis); } */ return res; } results with fpga_emulation: ./knn.fpga_emu results for fpga hardware: ./knn.fpga Thank you so much! SolvedRe: parallel_for very slow in dpc++ 1- I know that my target is A10, I am asking about the commands that help me see the information related to a specific device or general devices, and the commands that helps me specify the frequency and so on, but I couldn't find them? 2- The code finally worked with this dpcpp commands for emulation and hardware run: dpcpp -fintelfpga -DFPGA_EMULATOR knn_trial2.cpp -o knn.fpga_emu and dpcpp -fintelfpga -Xshardware fpga_compile.cpp -o fpga_compile.fpga The results give now an average of 0.0005 s, which is fine, it is still slower than the iterative code, this might be because of overhead you mentioned? yet it is way faster than python code that runs in 0.012 s. The segmentation fault is caused by sorting, it works fine with emulation but the segmentation fault is only after fpga run, Isn't host and kernel codes separated even with fpga run? 3- I have a question regarding parallel_for, single_task, and work groups: Doesn't parallel_for mean that all the elements (from 0 to num_size) do the same job at the same time? i.e. runs in parallel. I have searched for the difference between parallel_for, single_task, and work groups, and didn't find satisfying explanation for each of them. does parallel_for cause all elements to run in one operation in almost one clock cycle? Thank you! invalid work group size error, dpc++ code running on Intel Arria 10 oneAPI on devcloud Hello, I am using devcloud to run my dpc++ code on FPGA hardware for accelration. I am using a node that runs Arria 10 OneAPI. I was able to run the fpga_emu file and the results were as expected. When I use FPGA hardware it gives this error: Caught a SYCL host exception: Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP_SIZE) terminate called after throwing an instance of 'cl::sycl::nd_range_error' what(): Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP_SIZE) Aborted I don't see any problem with the sizes of the work groups. range<1> num_items{dataset.size()}; res.resize(dataset.size()); buffer dataset_buf(linear_dataset); buffer curr_test_buf(curr_test); buffer res_buf(res.data(), num_items); std::cout<<"submit a job"<<std::endl; //auto start = std::chrono::high_resolution_clock::now(); { q.submit([&](handler& h) { accessor a(dataset_buf, h, read_only); accessor b(curr_test_buf, h, read_only); accessor dif(res_buf, h, read_write, no_init); h.parallel_for_work_group(range<1>(32), range<1>(500), [=](group<1> g) { g.parallel_for_work_item([&](h_item<1> item) { int i = item.get_global_id(0); for (int j = 0; j < 5; ++j) { dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]); } // out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl; }); }); }).wait(); } I previously used normal parallel_for like this, and it gave me huge time on FPGA hardware to run, which accelerated nothing actually, that's why I though of work groups: range<1> num_items{dataset.size()}; std::vector<double>res; res.resize(dataset.size()); buffer dataset_buf(linear_dataset); buffer curr_test_buf(curr_test); buffer res_buf(res.data(), num_items); std::cout<<"submit a job"<<std::endl; //auto start = std::chrono::high_resolution_clock::now(); { q.submit([&](handler& h) { accessor a(dataset_buf, h, read_only); accessor b(curr_test_buf, h, read_only); accessor dif(res_buf, h, read_write, no_init); h.parallel_for(num_items, [=](auto i) { // dif[i] = a[i].size() * 1.0;// a[i]; for (int j = 0; j < 5; ++j) { dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]); } // out << "i : " << i << " i[0]: " << i[0] << " b: " << b[0] << cl::sycl::endl; }); }).wait(); } Thanks a lot!