Forum Discussion
GNL
New Contributor
5 years agoInstead of measuring the add_arrays_parallel() function as a whole, I separated it to chunks as shown below:
void add_arrays_parallel(IntArray &sum, const IntArray &addend_1, const IntArray &addend_2) {
double start = 0.0, stop = 0.0, secs = 0.0;
RTM_START();
std::unique_ptr<queue> q = initialize_device_queue();
RTM_STOP();
printf("Init device queue = %2.6lf seconds\n", secs);
RTM_START();
// The range of the arrays managed by the buffer
range<1> num_items{ array_size };
// Buffers are used to tell DPC++ which data will be shared between the host
// and the devices because they usually don't share physical memory
// The pointer that's being passed as the first parameter transfers ownership
// of the data to DPC++ at runtime. The destructor is called when the buffer
// goes out of scope and the data is given back to the std::arrays.
// The second parameter specifies the range given to the buffer.
buffer<cl_int, 1> addend_1_buf(addend_1.data(), num_items);
buffer<cl_int, 1> addend_2_buf(addend_2.data(), num_items);
buffer<cl_int, 1> sum_buf(sum.data(), num_items);
RTM_STOP();
printf("Buffer creation = %2.6lf seconds\n", secs);
RTM_START();
// queue::submit takes in a lambda that is passed in a command group handler
// constructed at runtime. The lambda also contains a command group, which
// contains the device-side operation and its dependencies
q->submit([&](handler &h) {
// Accessors are the only way to get access to the memory owned
// by the buffers initialized above. The first get_access template parameter
// specifies the access mode for the memory and the second template
// parameter is the type of memory to access the data from; this parameter
// has a default value
auto addend_1_accessor = addend_1_buf.template get_access<dp_read>(h);
auto addend_2_accessor = addend_2_buf.template get_access<dp_read>(h);
// Note: Can use access::mode::discard_write instead of access::mode::write
// because we're replacing the contents of the entire buffer.
auto sum_accessor = sum_buf.template get_access<dp_write>(h);
// Use parallel_for to run array addition in parallel. This executes the
// kernel. The first parameter is the number of work items to use and the
// second is the kernel, a lambda that specifies what to do per work item.
// The template parameter ArrayAdd is used to name the kernel at runtime.
// The parameter passed to the lambda is the work item id of the current
// item.
//
// To remove the requirement to specify the kernel name you can enable
// unnamed lamdba kernels with the option:
// dpcpp -fsycl-unnamed-lambda
h.parallel_for<class ArrayAdd>(num_items, [=](id<1> i) {
sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i];
});
});
RTM_STOP();
printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds\n", secs);
RTM_START();
// call wait_and_throw to catch async exception
q->wait_and_throw();
RTM_STOP();
printf("Q wait_and_throw() execution time = %2.6lf seconds\n", secs);
// DPC++ will enqueue and run the kernel. Recall that the buffer's data is
// given back to the host at the end of the method's scope.
}When I build & run this for a SYCL host device, my results are:
########################################################################
# Date: Thu Mar 12 03:55:56 PDT 2020
# Job ID: 543019.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
CPU Freq = 939189000.00
Scalar execution time = 0.000086 seconds
Init device queue = 0.253235 seconds
Buffer creation = 0.000001 seconds
Queue submission + Accessors + parallel execution time = 0.000085 seconds
Q wait_and_throw() execution time = 0.000002 seconds
success
########################################################################
# End of output for job 543019.v-qsvr-1.aidevcloud
# Date: Thu Mar 12 03:55:57 PDT 2020
########################################################################Then I found out that the initialize_device_queue() function is the slowest one. And the Queue submission + Accessors + parallel execution time is almost the same with the Scalar execution time...
Then I build & run this for a GPU device, and my results are:
########################################################################
# Date: Thu Mar 12 04:00:45 PDT 2020
# Job ID: 543025.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
CPU Freq = 1000062000.00
Scalar execution time = 0.000077 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Init device queue = 0.422673 seconds
Buffer creation = 0.000004 seconds
Queue submission + Accessors + parallel execution time = 1.751712 seconds
Q wait_and_throw() execution time = 0.001086 seconds
success
########################################################################
# End of output for job 543025.v-qsvr-1.aidevcloud
# Date: Thu Mar 12 04:00:47 PDT 2020
########################################################################and the Queue submission + Accessors + parallel execution time is 1.75 seconds
What takes so long?