Forum Discussion
Hi,
When I try to increase the array size as suggested, I get a seg-fault on the cloud and a stack overflow exception, on my computer using intel oneapi base toolkit (Windows) (FPGA Emulator)
// Problem size for this example
//constexpr size_t array_size = 10000;
constexpr size_t array_size = 100000000;I could not find a way to increase the stack size and then decided to use a vector instead of an array.
// Define the ARRAY type for use in this example
// typedef std::array<cl::sycl::cl_int, array_size> IntArray;
typedef std::vector<cl::sycl::cl_int> IntArray;and, create vectors using the v(size, val) constructor.
IntArray addend_1(array_size, 0), addend_2(array_size, 0), sum_scalar(array_size, 0), sum_parallel(array_size, 0);I tried for the same size, and checked my result vector is filled as expected. Then, I increased the array_size to 100M, and this is the result:
########################################################################
# Date: Mon Mar 30 06:10:14 PDT 2020
# Job ID: 560373.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063809 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Parallel execution time = 1.711407 seconds
success
########################################################################
# End of output for job 560373.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:10:20 PDT 2020
########################################################################The code blocks I measured here:
RTM_START();
// Add arrays in scalar and in parallel
add_arrays_scalar(sum_scalar, addend_1, addend_2);
RTM_STOP();
printf("Scalar execution time = %2.6lf seconds\n", secs);
RTM_START();
add_arrays_parallel(sum_parallel, addend_1, addend_2);
RTM_STOP();
printf("Parallel execution time = %2.6lf seconds\n", secs);Finally, I measured it again partially and this is the result:
########################################################################
# Date: Mon Mar 30 06:45:23 PDT 2020
# Job ID: 560414.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063832 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Init device queue = 1.273701 seconds
Buffer creation = 0.000002 seconds
Queue submission + Accessors + parallel execution time = 0.504512 seconds
Q wait_and_throw() execution time = 0.063734 seconds
Parallel execution time = 3.893053 seconds
success
########################################################################
# End of output for job 560414.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:45:29 PDT 2020
########################################################################The blocks I measured above, inside the add_arrays_parallel() function:
RTM_START();
std::unique_ptr<queue> q = initialize_device_queue();
RTM_STOP();
printf("Init device queue = %2.6lf seconds\n", secs);
RTM_START();
// The range of the arrays managed by the buffer
range<1> num_items{ array_size };
// Buffers are used to tell DPC++ which data will be shared between the host
// and the devices because they usually don't share physical memory
// The pointer that's being passed as the first parameter transfers ownership
// of the data to DPC++ at runtime. The destructor is called when the buffer
// goes out of scope and the data is given back to the std::arrays.
// The second parameter specifies the range given to the buffer.
buffer<cl_int, 1> addend_1_buf(addend_1.data(), num_items);
buffer<cl_int, 1> addend_2_buf(addend_2.data(), num_items);
buffer<cl_int, 1> sum_buf(sum.data(), num_items);
RTM_STOP();
printf("Buffer creation = %2.6lf seconds\n", secs);
RTM_START();
// queue::submit takes in a lambda that is passed in a command group handler
// constructed at runtime. The lambda also contains a command group, which
// contains the device-side operation and its dependencies
q->submit([&](handler &h) {
// Accessors are the only way to get access to the memory owned
// by the buffers initialized above. The first get_access template parameter
// specifies the access mode for the memory and the second template
// parameter is the type of memory to access the data from; this parameter
// has a default value
auto addend_1_accessor = addend_1_buf.template get_access<dp_read>(h);
auto addend_2_accessor = addend_2_buf.template get_access<dp_read>(h);
// Note: Can use access::mode::discard_write instead of access::mode::write
// because we're replacing the contents of the entire buffer.
auto sum_accessor = sum_buf.template get_access<dp_write>(h);
// Use parallel_for to run array addition in parallel. This executes the
// kernel. The first parameter is the number of work items to use and the
// second is the kernel, a lambda that specifies what to do per work item.
// The template parameter ArrayAdd is used to name the kernel at runtime.
// The parameter passed to the lambda is the work item id of the current
// item.
//
// To remove the requirement to specify the kernel name you can enable
// unnamed lamdba kernels with the option:
// dpcpp -fsycl-unnamed-lambda
h.parallel_for<class ArrayAdd>(num_items, [=](id<1> i) {
sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i];
});
});
RTM_STOP();
printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds\n", secs);
RTM_START();
// call wait_and_throw to catch async exception
q->wait_and_throw();
RTM_STOP();
printf("Q wait_and_throw() execution time = %2.6lf seconds\n", secs);There is still a big difference between Scalar execution time = 0.063832 seconds and Queue submission + Accessors + parallel execution time = 0.504512 seconds blocks.
Parallel exec seems 10 times slower than the scalar one. Did I miss something here?
In addition to this,
I found this about the sycl profiling events,
https://codeplay.com/portal/08-27-19-optimizing-your-sycl-code-using-profiling
https://gist.github.com/GeorgeWeb/ff908516bfe57f107bc36822dbdfe145
void profile(event_list& eventList, const time_point_list& startTimeList) {
if (startTimeList.size() != eventList.size()) {
std::string errMsg =
"The number of events do not match the number of starting time "
"points.";
throw std::runtime_error("Profiling Error:\n" + errMsg);
}
T cgSubmissionTime = 0;
T kernExecutionTime = 0;
T realExecutionTime = 0;
const auto eventCount = eventList.size();
for (size_t i = 0; i < eventCount; ++i) {
auto curEvent = eventList.at(i);
curEvent.wait();
auto curStartTime = startTimeList.at(i);
const auto end = wall_clock_t::now();
time_interval_t<T, std::milli> curRealExecutionTime = end - curStartTime;
realExecutionTime += curRealExecutionTime.count();
const auto cgSubmissionTimePoint = curEvent.template get_profiling_info<
cl::sycl::info::event_profiling::command_submit>();
const auto startKernExecutionTimePoint =
curEvent.template get_profiling_info<
cl::sycl::info::event_profiling::command_start>();
const auto endKernExecutionTimePoint =
curEvent.template get_profiling_info<
cl::sycl::info::event_profiling::command_end>();
cgSubmissionTime +=
to_milli(startKernExecutionTimePoint - cgSubmissionTimePoint);
kernExecutionTime +=
to_milli(endKernExecutionTimePoint - startKernExecutionTimePoint);
}
set_command_group_submission_time(cgSubmissionTime);
set_kernel_execution_time(kernExecutionTime);
set_real_execution_time(realExecutionTime);
}And applied the steps to my vector addition code:
void VectorAddInDPCPP_2(const IntArray &VA, const IntArray &VB, IntArray &VC, queue deviceQueue) {
// print out the device information used for the kernel code
std::cout << "Device: " << deviceQueue.get_device().get_info<info::device::name>() << std::endl;
// set up profiling data containers
using wall_clock_t = std::chrono::high_resolution_clock;
using time_point_t = std::chrono::time_point<wall_clock_t>;
int profiling_iters = 1;
std::vector<cl::sycl::event> eventList(profiling_iters);
std::vector<time_point_t> startTimeList(profiling_iters);
// create the range object for the arrays managed by the buffer
range<1> num_items{ array_size };
buffer<int, 1> bufferA(VA.data(), num_items);
buffer<int, 1> bufferB(VB.data(), num_items);
buffer<int, 1> bufferC(VC.data(), num_items);
// Submit a kernel to the queue, returns a SYCL event
for (size_t i = 0; i < profiling_iters; ++i) {
startTimeList.at(i) = wall_clock_t::now();
eventList.at(i) = deviceQueue.submit([&](handler &cgh) {
auto accessorA = bufferA.get_access<dp_read>(cgh);
auto accessorB = bufferB.get_access<dp_read>(cgh);
auto accessorC = bufferC.get_access<dp_write>(cgh);
cgh.parallel_for(num_items, [=](id<1> j) {
accessorC[j] = accessorA[j] + accessorB[j];
});
});
}
// exec profile
example_profiler<double> my_profiler(eventList, startTimeList);
std::cout << "Kernel exec: " << my_profiler.get_kernel_execution_time() << " msec" << std::endl;
std::cout << "Cmd Group submission: " << my_profiler.get_command_group_submission_time() << " msec" << std::endl;
std::cout << "Real exec: " << my_profiler.get_real_execution_time() << " msec" << std::endl;
}Finally, this is the result for a 10K array, on the cloud:
########################################################################
# Date: Wed Apr 1 08:26:52 PDT 2020
# Job ID: 561887.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
Device: Intel(R) Gen9 HD Graphics NEO
Kernel exec: 0.150332 msec
Cmd Group submission: 2.66203 msec
Real exec: 167.013 msec
VectorAddInDPCPP exec: 232.501 msec
Scalar exec: 0.095 msec
success
########################################################################
# End of output for job 561887.v-qsvr-1.aidevcloud
# Date: Wed Apr 1 08:26:57 PDT 2020
########################################################################As you can see above, the queue submission + kernel execution time is much less than the real execution time. What is the reason for waiting that long and is there a way to improve/get rid of this waiting time?
Melih