Forum Discussion
Hi,
There is some overhead when using an accelerator, so you have make the array size much larger to see the advantage of parallel operation.
In the file vector_add.cpp: Change the following, rebuild and rerun the example.
static const size_t ARRAY_SIZE = 10000;
to
static const size_t ARRAY_SIZE = 10000000;
and let us know the results
Regards
Anil
Hi,
When I try to increase the array size as suggested, I get a seg-fault on the cloud and a stack overflow exception, on my computer using intel oneapi base toolkit (Windows) (FPGA Emulator)
// Problem size for this example
//constexpr size_t array_size = 10000;
constexpr size_t array_size = 100000000;I could not find a way to increase the stack size and then decided to use a vector instead of an array.
// Define the ARRAY type for use in this example
// typedef std::array<cl::sycl::cl_int, array_size> IntArray;
typedef std::vector<cl::sycl::cl_int> IntArray;and, create vectors using the v(size, val) constructor.
IntArray addend_1(array_size, 0), addend_2(array_size, 0), sum_scalar(array_size, 0), sum_parallel(array_size, 0);I tried for the same size, and checked my result vector is filled as expected. Then, I increased the array_size to 100M, and this is the result:
########################################################################
# Date: Mon Mar 30 06:10:14 PDT 2020
# Job ID: 560373.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063809 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Parallel execution time = 1.711407 seconds
success
########################################################################
# End of output for job 560373.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:10:20 PDT 2020
########################################################################The code blocks I measured here:
RTM_START();
// Add arrays in scalar and in parallel
add_arrays_scalar(sum_scalar, addend_1, addend_2);
RTM_STOP();
printf("Scalar execution time = %2.6lf seconds\n", secs);
RTM_START();
add_arrays_parallel(sum_parallel, addend_1, addend_2);
RTM_STOP();
printf("Parallel execution time = %2.6lf seconds\n", secs);Finally, I measured it again partially and this is the result:
########################################################################
# Date: Mon Mar 30 06:45:23 PDT 2020
# Job ID: 560414.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063832 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Init device queue = 1.273701 seconds
Buffer creation = 0.000002 seconds
Queue submission + Accessors + parallel execution time = 0.504512 seconds
Q wait_and_throw() execution time = 0.063734 seconds
Parallel execution time = 3.893053 seconds
success
########################################################################
# End of output for job 560414.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:45:29 PDT 2020
########################################################################The blocks I measured above, inside the add_arrays_parallel() function:
RTM_START();
std::unique_ptr<queue> q = initialize_device_queue();
RTM_STOP();
printf("Init device queue = %2.6lf seconds\n", secs);
RTM_START();
// The range of the arrays managed by the buffer
range<1> num_items{ array_size };
// Buffers are used to tell DPC++ which data will be shared between the host
// and the devices because they usually don't share physical memory
// The pointer that's being passed as the first parameter transfers ownership
// of the data to DPC++ at runtime. The destructor is called when the buffer
// goes out of scope and the data is given back to the std::arrays.
// The second parameter specifies the range given to the buffer.
buffer<cl_int, 1> addend_1_buf(addend_1.data(), num_items);
buffer<cl_int, 1> addend_2_buf(addend_2.data(), num_items);
buffer<cl_int, 1> sum_buf(sum.data(), num_items);
RTM_STOP();
printf("Buffer creation = %2.6lf seconds\n", secs);
RTM_START();
// queue::submit takes in a lambda that is passed in a command group handler
// constructed at runtime. The lambda also contains a command group, which
// contains the device-side operation and its dependencies
q->submit([&](handler &h) {
// Accessors are the only way to get access to the memory owned
// by the buffers initialized above. The first get_access template parameter
// specifies the access mode for the memory and the second template
// parameter is the type of memory to access the data from; this parameter
// has a default value
auto addend_1_accessor = addend_1_buf.template get_access<dp_read>(h);
auto addend_2_accessor = addend_2_buf.template get_access<dp_read>(h);
// Note: Can use access::mode::discard_write instead of access::mode::write
// because we're replacing the contents of the entire buffer.
auto sum_accessor = sum_buf.template get_access<dp_write>(h);
// Use parallel_for to run array addition in parallel. This executes the
// kernel. The first parameter is the number of work items to use and the
// second is the kernel, a lambda that specifies what to do per work item.
// The template parameter ArrayAdd is used to name the kernel at runtime.
// The parameter passed to the lambda is the work item id of the current
// item.
//
// To remove the requirement to specify the kernel name you can enable
// unnamed lamdba kernels with the option:
// dpcpp -fsycl-unnamed-lambda
h.parallel_for<class ArrayAdd>(num_items, [=](id<1> i) {
sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i];
});
});
RTM_STOP();
printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds\n", secs);
RTM_START();
// call wait_and_throw to catch async exception
q->wait_and_throw();
RTM_STOP();
printf("Q wait_and_throw() execution time = %2.6lf seconds\n", secs);There is still a big difference between Scalar execution time = 0.063832 seconds and Queue submission + Accessors + parallel execution time = 0.504512 seconds blocks.
Parallel exec seems 10 times slower than the scalar one. Did I miss something here?
- GNL5 years ago
New Contributor
In addition to this,
I found this about the sycl profiling events,
https://codeplay.com/portal/08-27-19-optimizing-your-sycl-code-using-profiling
https://gist.github.com/GeorgeWeb/ff908516bfe57f107bc36822dbdfe145
void profile(event_list& eventList, const time_point_list& startTimeList) { if (startTimeList.size() != eventList.size()) { std::string errMsg = "The number of events do not match the number of starting time " "points."; throw std::runtime_error("Profiling Error:\n" + errMsg); } T cgSubmissionTime = 0; T kernExecutionTime = 0; T realExecutionTime = 0; const auto eventCount = eventList.size(); for (size_t i = 0; i < eventCount; ++i) { auto curEvent = eventList.at(i); curEvent.wait(); auto curStartTime = startTimeList.at(i); const auto end = wall_clock_t::now(); time_interval_t<T, std::milli> curRealExecutionTime = end - curStartTime; realExecutionTime += curRealExecutionTime.count(); const auto cgSubmissionTimePoint = curEvent.template get_profiling_info< cl::sycl::info::event_profiling::command_submit>(); const auto startKernExecutionTimePoint = curEvent.template get_profiling_info< cl::sycl::info::event_profiling::command_start>(); const auto endKernExecutionTimePoint = curEvent.template get_profiling_info< cl::sycl::info::event_profiling::command_end>(); cgSubmissionTime += to_milli(startKernExecutionTimePoint - cgSubmissionTimePoint); kernExecutionTime += to_milli(endKernExecutionTimePoint - startKernExecutionTimePoint); } set_command_group_submission_time(cgSubmissionTime); set_kernel_execution_time(kernExecutionTime); set_real_execution_time(realExecutionTime); }And applied the steps to my vector addition code:
void VectorAddInDPCPP_2(const IntArray &VA, const IntArray &VB, IntArray &VC, queue deviceQueue) { // print out the device information used for the kernel code std::cout << "Device: " << deviceQueue.get_device().get_info<info::device::name>() << std::endl; // set up profiling data containers using wall_clock_t = std::chrono::high_resolution_clock; using time_point_t = std::chrono::time_point<wall_clock_t>; int profiling_iters = 1; std::vector<cl::sycl::event> eventList(profiling_iters); std::vector<time_point_t> startTimeList(profiling_iters); // create the range object for the arrays managed by the buffer range<1> num_items{ array_size }; buffer<int, 1> bufferA(VA.data(), num_items); buffer<int, 1> bufferB(VB.data(), num_items); buffer<int, 1> bufferC(VC.data(), num_items); // Submit a kernel to the queue, returns a SYCL event for (size_t i = 0; i < profiling_iters; ++i) { startTimeList.at(i) = wall_clock_t::now(); eventList.at(i) = deviceQueue.submit([&](handler &cgh) { auto accessorA = bufferA.get_access<dp_read>(cgh); auto accessorB = bufferB.get_access<dp_read>(cgh); auto accessorC = bufferC.get_access<dp_write>(cgh); cgh.parallel_for(num_items, [=](id<1> j) { accessorC[j] = accessorA[j] + accessorB[j]; }); }); } // exec profile example_profiler<double> my_profiler(eventList, startTimeList); std::cout << "Kernel exec: " << my_profiler.get_kernel_execution_time() << " msec" << std::endl; std::cout << "Cmd Group submission: " << my_profiler.get_command_group_submission_time() << " msec" << std::endl; std::cout << "Real exec: " << my_profiler.get_real_execution_time() << " msec" << std::endl; }Finally, this is the result for a 10K array, on the cloud:
######################################################################## # Date: Wed Apr 1 08:26:52 PDT 2020 # Job ID: 561887.v-qsvr-1.aidevcloud # User: u38134 # Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00 ######################################################################## :: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force ./vector-add Device: Intel(R) Gen9 HD Graphics NEO Kernel exec: 0.150332 msec Cmd Group submission: 2.66203 msec Real exec: 167.013 msec VectorAddInDPCPP exec: 232.501 msec Scalar exec: 0.095 msec success ######################################################################## # End of output for job 561887.v-qsvr-1.aidevcloud # Date: Wed Apr 1 08:26:57 PDT 2020 ########################################################################As you can see above, the queue submission + kernel execution time is much less than the real execution time. What is the reason for waiting that long and is there a way to improve/get rid of this waiting time?
Melih