Forum Discussion

amaltaha's avatar
amaltaha
Icon for Occasional Contributor rankOccasional Contributor
4 years ago

parallel_for very slow in dpc++

Hello,

I really need help with this.
I am trying to accelerate an algorithm using DPC++. what happens is that the normal calculations takes 1.5 times faster than kernel parallel execution. The following code is for both calculations:

the num_items currently equals 16,000, I tried small values like 500 but the same thing, the CPU is way faster the kernel.

// This is the normal CPU calculations
std::vector<double> distance_calculation(std::vector<std::vector<double>>& dataset, std::vector<double>& curr_test) {
    auto start = std::chrono::high_resolution_clock::now();
    std::vector<double>res;
    for (int i = 0; i < dataset.size(); ++i) {
        double dis = 0;
        for (int j = 0; j < dataset[i].size(); ++j) {
            dis += (curr_test[j] - dataset[i][j]) * (curr_test[j] - dataset[i][j]);
        }
        res.push_back(dis);
    }
    auto finish = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> elapsed = finish - start;
    std::cout << "Elapsed time: " << elapsed.count() << " s\n";
    return res;
}
// This is for FPGA emulation.
std::vector<double> distance_calculation_FPGA(queue& q, const std::vector<std::vector<double>>& dataset, const std::vector<double>& curr_test) {
    std::vector<double>linear_dataset;
    for (int i = 0; i < dataset.size(); ++i) {
        for (int j = 0; j < dataset[i].size(); ++j) {
            linear_dataset.push_back(dataset[i][j]);
        }
    }
    range<1> num_items{dataset.size()};
    std::vector<double>res;
    //std::cout << "im in" << std::endl;

    res.resize(dataset.size());
    buffer dataset_buf(linear_dataset);
    buffer curr_test_buf(curr_test);
    buffer res_buf(res.data(), num_items);
    {
        auto start = std::chrono::high_resolution_clock::now();
        q.submit([&](handler& h) {
            accessor a(dataset_buf, h, read_only);
            accessor b(curr_test_buf, h, read_only);

            accessor dif(res_buf, h, read_write, no_init);
            h.parallel_for(range<1>(num_items), [=](id<1> i) {
                //  dif[i] = a[i].size() * 1.0;// a[i];
                for (int j = 0; j < 5; ++j) {
                    dif[i] += (b[j] - a[i * 5 + j]) * (b[j] - a[i * 5 + j]);

                }
                });
            });
            q.wait();
            auto finish = std::chrono::high_resolution_clock::now();
            std::chrono::duration<double> elapsed = finish - start;
            std::cout << "Elapsed time: " << elapsed.count() << " s\n";

    }
    /*
        for (int i = 0; i < dataset.size(); ++i) {
            double dis = 0;
            for (int j = 0; j < dataset[i].size(); ++j) {
                dis += (curr_test[j] - dataset[i][j]) * (curr_test[j] - dataset[i][j]);
            }
            res.push_back(dis);
        }
        */
    return res;
}

17 Replies

  • Hi @amaltaha,


    Great! Good to know that you are able to get it worked, with no further clarification on this thread, it will be transitioned to community support for further help on doubts in this thread.

    Thank you for the questions and as always pleasure having you here.


    Best Wishes

    BB