DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

New Contributor

6 years ago

Instead of measuring the add_arrays_parallel() function as a whole, I separated it to chunks as shown below:

void add_arrays_parallel(IntArray &sum, const IntArray &addend_1, const IntArray &addend_2) {
	double start = 0.0, stop = 0.0, secs = 0.0;
 
	RTM_START();
	std::unique_ptr<queue> q = initialize_device_queue();
	RTM_STOP();
	printf("Init device queue = %2.6lf seconds\n", secs);
 
	RTM_START();
	// The range of the arrays managed by the buffer
	range<1> num_items{ array_size };
 
	// Buffers are used to tell DPC++ which data will be shared between the host
	// and the devices because they usually don't share physical memory
	// The pointer that's being passed as the first parameter transfers ownership
	// of the data to DPC++ at runtime. The destructor is called when the buffer
	// goes out of scope and the data is given back to the std::arrays.
	// The second parameter specifies the range given to the buffer.
	buffer<cl_int, 1> addend_1_buf(addend_1.data(), num_items);
	buffer<cl_int, 1> addend_2_buf(addend_2.data(), num_items);
	buffer<cl_int, 1> sum_buf(sum.data(), num_items);
	RTM_STOP();
	printf("Buffer creation = %2.6lf seconds\n", secs);
 
	RTM_START();
	// queue::submit takes in a lambda that is passed in a command group handler
	// constructed at runtime. The lambda also contains a command group, which
	// contains the device-side operation and its dependencies
	q->submit([&](handler &h) {
		// Accessors are the only way to get access to the memory owned
		// by the buffers initialized above. The first get_access template parameter
		// specifies the access mode for the memory and the second template
		// parameter is the type of memory to access the data from; this parameter
		// has a default value
		auto addend_1_accessor = addend_1_buf.template get_access<dp_read>(h);
		auto addend_2_accessor = addend_2_buf.template get_access<dp_read>(h);
 
		// Note: Can use access::mode::discard_write instead of access::mode::write
		// because we're replacing the contents of the entire buffer.
		auto sum_accessor = sum_buf.template get_access<dp_write>(h);
 
		// Use parallel_for to run array addition in parallel. This executes the
		// kernel. The first parameter is the number of work items to use and the
		// second is the kernel, a lambda that specifies what to do per work item.
		// The template parameter ArrayAdd is used to name the kernel at runtime.
		// The parameter passed to the lambda is the work item id of the current
		// item.
		//
		// To remove the requirement to specify the kernel name you can enable
		// unnamed lamdba kernels with the option:
		//     dpcpp -fsycl-unnamed-lambda
		h.parallel_for<class ArrayAdd>(num_items, [=](id<1> i) {
			sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i];
		});
	});
	RTM_STOP();
	printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds\n", secs);
 
	RTM_START();
	// call wait_and_throw to catch async exception
	q->wait_and_throw();
	RTM_STOP();
	printf("Q wait_and_throw() execution time = %2.6lf seconds\n", secs);
 
	// DPC++ will enqueue and run the kernel. Recall that the buffer's data is
	// given back to the host at the end of the method's scope.
}

When I build & run this for a SYCL host device, my results are:

########################################################################
#      Date:           Thu Mar 12 03:55:56 PDT 2020
#    Job ID:           543019.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
CPU Freq = 939189000.00
Scalar execution time = 0.000086 seconds
Init device queue = 0.253235 seconds
Buffer creation = 0.000001 seconds
Queue submission + Accessors + parallel execution time = 0.000085 seconds
Q wait_and_throw() execution time = 0.000002 seconds
success
 
########################################################################
# End of output for job 543019.v-qsvr-1.aidevcloud
# Date: Thu Mar 12 03:55:57 PDT 2020
########################################################################

Then I found out that the initialize_device_queue() function is the slowest one. And the Queue submission + Accessors + parallel execution time is almost the same with the Scalar execution time...

Then I build & run this for a GPU device, and my results are:

########################################################################
#      Date:           Thu Mar 12 04:00:45 PDT 2020
#    Job ID:           543025.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
CPU Freq = 1000062000.00
Scalar execution time = 0.000077 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Init device queue = 0.422673 seconds
Buffer creation = 0.000004 seconds
Queue submission + Accessors + parallel execution time = 1.751712 seconds
Q wait_and_throw() execution time = 0.001086 seconds
success
 
########################################################################
# End of output for job 543025.v-qsvr-1.aidevcloud
# Date: Thu Mar 12 04:00:47 PDT 2020
########################################################################

and the Queue submission + Accessors + parallel execution time is 1.75 seconds

What takes so long?

Forum Discussion

DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process