DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

Question

Hello,I tried the BaseKit-code-samples/DPC++Compiler/vector-add example as suggested in the DeveloperZone get started page:https://devcloud.intel.com/oneapi/get-started/base-toolkit/I measured the execution time of add_arrays_scalar() and add_arrays_parallel() functions with my RTM_START() and RTM_STOP() macros.double getCPUFreq() {
#define BUFLEN 110
 
	FILE* sysinfo;
	char* ptr;
	char buf[BUFLEN];
	char key[] = "cpu MHz";
	int keylen = sizeof(key) - 1;
	double freq = -1;
 
	sysinfo = fopen("/proc/cpuinfo", "r");
	if (sysinfo != NULL) {
		while (fgets(buf, BUFLEN, sysinfo) != NULL) {
			if (!strncmp(buf, key, keylen)) {
				ptr = strstr(buf, ":");
				freq = atof(ptr + 1) * 1000000;
				break;
			}
		}
		fclose(sysinfo);
	}
	fprintf(stderr, "Freq = %f GHz
", freq / 1000000000);
	return freq;
}
 
#define RTM_START()	start = (double)_rdtsc()
 
#define RTM_STOP() 	stop = (double)_rdtsc(); \
			secs = ((double)(stop - start)) / (double)getCPUFreq();and my main function, where I took the measurements:int main() {
	
	double start = 0.0, stop = 0.0, secs = 0.0;
 
	IntArray addend_1, addend_2, sum_scalar, sum_parallel;
 
	// Initialize arrays with values from 0 to array_size-1
	initialize_array(addend_1);
	initialize_array(addend_2);
	initialize_array(sum_scalar);
	initialize_array(sum_parallel);
 
	printf("CPU Freq = %10.2lf
", (double)getCPUFreq());
 
	start = (double)_rdtsc();
	// Add arrays in scalar and in parallel
	add_arrays_scalar(sum_scalar, addend_1, addend_2);
	stop = (double)_rdtsc();
	secs = ((double)(stop - start)) / (double)getCPUFreq();
	printf("Scalar execution time = %2.6lf seconds
", secs);
 
  	add_arrays_parallel(sum_parallel, addend_1, addend_2);
 
	// Verify that the two sum arrays are equal
	for (size_t i = 0; i &lt; sum_parallel.size(); i++) {
		if (sum_parallel[i] != sum_scalar[i]) {
			std::cout &lt;&lt; "fail" &lt;&lt; std::endl;
			return -1;
		}
	}
	std::cout &lt;&lt; "success" &lt;&lt; std::endl;
 
	std::cout &lt;&lt; "MGUNAL TEST 5 DONE" &lt;&lt; std::endl;
	return 0;
}And finally my results:########################################################################
#      Date:           Wed Mar 11 06:14:01 PDT 2020
#    Job ID:           542630.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
CPU Freq = 954182000.00
Scalar execution time = 0.000085 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Parallel execution time = 3.516770 seconds
success
MGUNAL TEST 5 DONE
 
########################################################################
# End of output for job 542630.v-qsvr-1.aidevcloud
# Date: Wed Mar 11 06:14:03 PDT 2020
########################################################################As seen above, parallel execution time is much more greater than the scaler execution time. So, I decided to use a SYCL host device instead of an accelerator./*
	// FPGA device selector:  Emulator or Hardware
#ifdef FPGA_EMULATOR
	intel::fpga_emulator_selector device_selector;
#elif defined(FPGA)
	intel::fpga_selector device_selector;
#else
 	// Initializing the devices queue with the default selector
  	// The device queue is used to enqueue the kernels and encapsulates
  	// all the states needed for execution
	default_selector device_selector;
#endif
*/
 
	host_selector device_selector;Parallel execution is still significantly slower than the scaler one..########################################################################
#      Date:           Thu Mar 12 00:34:12 PDT 2020
#    Job ID:           542971.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
CPU Freq = 1000065000.00
Scalar execution time = 0.000077 seconds
Device: SYCL host device
Parallel execution time = 0.318517 seconds
success
 
########################################################################
# End of output for job 542971.v-qsvr-1.aidevcloud
# Date: Thu Mar 12 00:34:14 PDT 2020
########################################################################I have no idea why the parallel execution takes so long? Am I doing something wrong?

gnl · Answer

Instead of measuring the add_arrays_parallel() function as a whole, I separated it to chunks as shown below:void add_arrays_parallel(IntArray &amp;sum, const IntArray &amp;addend_1, const IntArray &amp;addend_2) {
	double start = 0.0, stop = 0.0, secs = 0.0;
 
	RTM_START();
	std::unique_ptr&lt;queue&gt; q = initialize_device_queue();
	RTM_STOP();
	printf("Init device queue = %2.6lf seconds
", secs);
 
	RTM_START();
	// The range of the arrays managed by the buffer
	range&lt;1&gt; num_items{ array_size };
 
	// Buffers are used to tell DPC++ which data will be shared between the host
	// and the devices because they usually don't share physical memory
	// The pointer that's being passed as the first parameter transfers ownership
	// of the data to DPC++ at runtime. The destructor is called when the buffer
	// goes out of scope and the data is given back to the std::arrays.
	// The second parameter specifies the range given to the buffer.
	buffer&lt;cl_int, 1&gt; addend_1_buf(addend_1.data(), num_items);
	buffer&lt;cl_int, 1&gt; addend_2_buf(addend_2.data(), num_items);
	buffer&lt;cl_int, 1&gt; sum_buf(sum.data(), num_items);
	RTM_STOP();
	printf("Buffer creation = %2.6lf seconds
", secs);
 
	RTM_START();
	// queue::submit takes in a lambda that is passed in a command group handler
	// constructed at runtime. The lambda also contains a command group, which
	// contains the device-side operation and its dependencies
	q-&gt;submit([&amp;](handler &amp;h) {
		// Accessors are the only way to get access to the memory owned
		// by the buffers initialized above. The first get_access template parameter
		// specifies the access mode for the memory and the second template
		// parameter is the type of memory to access the data from; this parameter
		// has a default value
		auto addend_1_accessor = addend_1_buf.template get_access&lt;dp_read&gt;(h);
		auto addend_2_accessor = addend_2_buf.template get_access&lt;dp_read&gt;(h);
 
		// Note: Can use access::mode::discard_write instead of access::mode::write
		// because we're replacing the contents of the entire buffer.
		auto sum_accessor = sum_buf.template get_access&lt;dp_write&gt;(h);
 
		// Use parallel_for to run array addition in parallel. This executes the
		// kernel. The first parameter is the number of work items to use and the
		// second is the kernel, a lambda that specifies what to do per work item.
		// The template parameter ArrayAdd is used to name the kernel at runtime.
		// The parameter passed to the lambda is the work item id of the current
		// item.
		//
		// To remove the requirement to specify the kernel name you can enable
		// unnamed lamdba kernels with the option:
		//     dpcpp -fsycl-unnamed-lambda
		h.parallel_for&lt;class ArrayAdd&gt;(num_items, [=](id&lt;1&gt; i) {
			sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i];
		});
	});
	RTM_STOP();
	printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds
", secs);
 
	RTM_START();
	// call wait_and_throw to catch async exception
	q-&gt;wait_and_throw();
	RTM_STOP();
	printf("Q wait_and_throw() execution time = %2.6lf seconds
", secs);
 
	// DPC++ will enqueue and run the kernel. Recall that the buffer's data is
	// given back to the host at the end of the method's scope.
}When I build &amp; run this for a SYCL host device, my results are:########################################################################
#      Date:           Thu Mar 12 03:55:56 PDT 2020
#    Job ID:           543019.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
CPU Freq = 939189000.00
Scalar execution time = 0.000086 seconds
Init device queue = 0.253235 seconds
Buffer creation = 0.000001 seconds
Queue submission + Accessors + parallel execution time = 0.000085 seconds
Q wait_and_throw() execution time = 0.000002 seconds
success
 
########################################################################
# End of output for job 543019.v-qsvr-1.aidevcloud
# Date: Thu Mar 12 03:55:57 PDT 2020
########################################################################Then I found out that the  initialize_device_queue() function is the slowest one. And the Queue submission + Accessors + parallel execution time is almost the same with the Scalar execution time...Then I build &amp; run this for a GPU device, and my results are:########################################################################
#      Date:           Thu Mar 12 04:00:45 PDT 2020
#    Job ID:           543025.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
CPU Freq = 1000062000.00
Scalar execution time = 0.000077 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Init device queue = 0.422673 seconds
Buffer creation = 0.000004 seconds
Queue submission + Accessors + parallel execution time = 1.751712 seconds
Q wait_and_throw() execution time = 0.001086 seconds
success
 
########################################################################
# End of output for job 543025.v-qsvr-1.aidevcloud
# Date: Thu Mar 12 04:00:47 PDT 2020
########################################################################and the Queue submission + Accessors + parallel execution time is 1.75 secondsWhat takes so long?

anilerinch_a_intel · Answer

Hi,

There is some overhead when using an accelerator, so you have make the array size much larger to see the advantage of parallel operation.

In the file vector_add.cpp: Change the following, rebuild and rerun the example.

static const size_t ARRAY_SIZE = 10000;

to

static const size_t ARRAY_SIZE = 10000000;

and let us know the results

Regards

Anil

gnl · Answer

Hi,When I try to increase the array size as suggested, I get a seg-fault on the cloud and a stack overflow exception, on my computer using intel oneapi base toolkit (Windows) (FPGA Emulator)// Problem size for this example
//constexpr size_t array_size = 10000;
constexpr size_t array_size = 100000000;I could not find a way to increase the stack size and then decided to use a vector instead of an array.// Define the ARRAY type for use in this example
// typedef std::array&lt;cl::sycl::cl_int, array_size&gt; IntArray;
typedef std::vector&lt;cl::sycl::cl_int&gt; IntArray;and, create vectors using the v(size, val) constructor.IntArray addend_1(array_size, 0), addend_2(array_size, 0), sum_scalar(array_size, 0), sum_parallel(array_size, 0);I tried for the same size, and checked my result vector is filled as expected. Then, I increased the array_size to 100M, and this is the result:########################################################################
#      Date:           Mon Mar 30 06:10:14 PDT 2020
#    Job ID:           560373.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063809 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Parallel execution time = 1.711407 seconds
success
 
########################################################################
# End of output for job 560373.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:10:20 PDT 2020
########################################################################The code blocks I measured here:	RTM_START();
	// Add arrays in scalar and in parallel
	add_arrays_scalar(sum_scalar, addend_1, addend_2);
	RTM_STOP();
	printf("Scalar execution time = %2.6lf seconds
", secs);
	
	RTM_START();
  	add_arrays_parallel(sum_parallel, addend_1, addend_2);
	RTM_STOP();
	printf("Parallel execution time = %2.6lf seconds
", secs);Finally, I measured it again partially and this is the result:########################################################################
#      Date:           Mon Mar 30 06:45:23 PDT 2020
#    Job ID:           560414.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063832 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Init device queue = 1.273701 seconds
Buffer creation = 0.000002 seconds
Queue submission + Accessors + parallel execution time = 0.504512 seconds
Q wait_and_throw() execution time = 0.063734 seconds
Parallel execution time = 3.893053 seconds
success
 
########################################################################
# End of output for job 560414.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:45:29 PDT 2020
########################################################################The  blocks I measured above, inside the add_arrays_parallel() function:        RTM_START();
	std::unique_ptr&lt;queue&gt; q = initialize_device_queue();
	RTM_STOP();
	printf("Init device queue = %2.6lf seconds
", secs);
 
	RTM_START();
	// The range of the arrays managed by the buffer
	range&lt;1&gt; num_items{ array_size };
 
	// Buffers are used to tell DPC++ which data will be shared between the host
	// and the devices because they usually don't share physical memory
	// The pointer that's being passed as the first parameter transfers ownership
	// of the data to DPC++ at runtime. The destructor is called when the buffer
	// goes out of scope and the data is given back to the std::arrays.
	// The second parameter specifies the range given to the buffer.
	buffer&lt;cl_int, 1&gt; addend_1_buf(addend_1.data(), num_items);
	buffer&lt;cl_int, 1&gt; addend_2_buf(addend_2.data(), num_items);
	buffer&lt;cl_int, 1&gt; sum_buf(sum.data(), num_items);
	RTM_STOP();
	printf("Buffer creation = %2.6lf seconds
", secs);
 
	RTM_START();
	// queue::submit takes in a lambda that is passed in a command group handler
	// constructed at runtime. The lambda also contains a command group, which
	// contains the device-side operation and its dependencies
	q-&gt;submit([&amp;](handler &amp;h) {
		// Accessors are the only way to get access to the memory owned
		// by the buffers initialized above. The first get_access template parameter
		// specifies the access mode for the memory and the second template
		// parameter is the type of memory to access the data from; this parameter
		// has a default value
		auto addend_1_accessor = addend_1_buf.template get_access&lt;dp_read&gt;(h);
		auto addend_2_accessor = addend_2_buf.template get_access&lt;dp_read&gt;(h);
 
		// Note: Can use access::mode::discard_write instead of access::mode::write
		// because we're replacing the contents of the entire buffer.
		auto sum_accessor = sum_buf.template get_access&lt;dp_write&gt;(h);
 
		// Use parallel_for to run array addition in parallel. This executes the
		// kernel. The first parameter is the number of work items to use and the
		// second is the kernel, a lambda that specifies what to do per work item.
		// The template parameter ArrayAdd is used to name the kernel at runtime.
		// The parameter passed to the lambda is the work item id of the current
		// item.
		//
		// To remove the requirement to specify the kernel name you can enable
		// unnamed lamdba kernels with the option:
		//     dpcpp -fsycl-unnamed-lambda
		h.parallel_for&lt;class ArrayAdd&gt;(num_items, [=](id&lt;1&gt; i) {
			sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i];
		});
	});
	RTM_STOP();
	printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds
", secs);
 
	RTM_START();
	// call wait_and_throw to catch async exception
	q-&gt;wait_and_throw();
	RTM_STOP();
	printf("Q wait_and_throw() execution time = %2.6lf seconds
", secs);There is still a big difference between Scalar execution time = 0.063832 seconds and Queue submission + Accessors + parallel execution time = 0.504512 seconds blocks.Parallel exec seems 10 times slower than the scalar one. Did I miss something here?

gnl · Answer

In addition to this,I found this about the sycl profiling events, https://codeplay.com/portal/08-27-19-optimizing-your-sycl-code-using-profilinghttps://gist.github.com/GeorgeWeb/ff908516bfe57f107bc36822dbdfe145void profile(event_list&amp; eventList, const time_point_list&amp; startTimeList) {
    if (startTimeList.size() != eventList.size()) {
      std::string errMsg =
          "The number of events do not match the number of starting time "
          "points.";
      throw std::runtime_error("Profiling Error:
" + errMsg);
    }
 
    T cgSubmissionTime = 0;
    T kernExecutionTime = 0;
    T realExecutionTime = 0;
    
    const auto eventCount = eventList.size();
    for (size_t i = 0; i &lt; eventCount; ++i) {
      auto curEvent = eventList.at(i);
      curEvent.wait();
      auto curStartTime = startTimeList.at(i);
 
      const auto end = wall_clock_t::now();
      time_interval_t&lt;T, std::milli&gt; curRealExecutionTime = end - curStartTime;
      realExecutionTime += curRealExecutionTime.count();
 
      const auto cgSubmissionTimePoint = curEvent.template get_profiling_info&lt;
          cl::sycl::info::event_profiling::command_submit&gt;();
      const auto startKernExecutionTimePoint =
          curEvent.template get_profiling_info&lt;
              cl::sycl::info::event_profiling::command_start&gt;();
      const auto endKernExecutionTimePoint =
          curEvent.template get_profiling_info&lt;
              cl::sycl::info::event_profiling::command_end&gt;();
 
      cgSubmissionTime +=
          to_milli(startKernExecutionTimePoint - cgSubmissionTimePoint);
      kernExecutionTime +=
          to_milli(endKernExecutionTimePoint - startKernExecutionTimePoint);
    }
 
    set_command_group_submission_time(cgSubmissionTime);
    set_kernel_execution_time(kernExecutionTime);
    set_real_execution_time(realExecutionTime);
  }And applied the steps to my vector addition code:void VectorAddInDPCPP_2(const IntArray &amp;VA, const IntArray &amp;VB, IntArray &amp;VC, queue deviceQueue) {
 
	// print out the device information used for the kernel code
	std::cout &lt;&lt; "Device: " &lt;&lt; deviceQueue.get_device().get_info&lt;info::device::name&gt;() &lt;&lt; std::endl;
 
	// set up profiling data containers
	using wall_clock_t = std::chrono::high_resolution_clock;
	using time_point_t = std::chrono::time_point&lt;wall_clock_t&gt;;
 
	int profiling_iters = 1;
 
	std::vector&lt;cl::sycl::event&gt; eventList(profiling_iters);
	std::vector&lt;time_point_t&gt; startTimeList(profiling_iters);
 
	// create the range object for the arrays managed by the buffer
	range&lt;1&gt; num_items{ array_size };
 
	buffer&lt;int, 1&gt; bufferA(VA.data(), num_items);
	buffer&lt;int, 1&gt; bufferB(VB.data(), num_items);
	buffer&lt;int, 1&gt; bufferC(VC.data(), num_items);
 
	// Submit a kernel to the queue, returns a SYCL event
	for (size_t i = 0; i &lt; profiling_iters; ++i) {
		startTimeList.at(i) = wall_clock_t::now();
		eventList.at(i) = deviceQueue.submit([&amp;](handler &amp;cgh) {
			auto accessorA = bufferA.get_access&lt;dp_read&gt;(cgh);
			auto accessorB = bufferB.get_access&lt;dp_read&gt;(cgh);
			auto accessorC = bufferC.get_access&lt;dp_write&gt;(cgh);
 
			cgh.parallel_for(num_items, [=](id&lt;1&gt; j) {
				accessorC[j] = accessorA[j] + accessorB[j];
			});
		});
	}
 
	// exec profile
	example_profiler&lt;double&gt; my_profiler(eventList, startTimeList);
	std::cout &lt;&lt; "Kernel exec:              " &lt;&lt; my_profiler.get_kernel_execution_time() &lt;&lt; " msec" &lt;&lt; std::endl;
	std::cout &lt;&lt; "Cmd Group submission:     " &lt;&lt; my_profiler.get_command_group_submission_time() &lt;&lt; " msec" &lt;&lt; std::endl;
	std::cout &lt;&lt; "Real exec:                " &lt;&lt; my_profiler.get_real_execution_time() &lt;&lt; " msec" &lt;&lt; std::endl;
}Finally, this is the result for a 10K array, on the cloud:########################################################################
#      Date:           Wed Apr  1 08:26:52 PDT 2020
#    Job ID:           561887.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
Device: Intel(R) Gen9 HD Graphics NEO
Kernel exec:              0.150332 msec
Cmd Group submission:     2.66203 msec
Real exec:                167.013 msec
VectorAddInDPCPP exec:    232.501 msec
Scalar exec:              0.095 msec
success
 
########################################################################
# End of output for job 561887.v-qsvr-1.aidevcloud
# Date: Wed Apr  1 08:26:57 PDT 2020
########################################################################As you can see above, the queue submission + kernel execution time is much less than the real execution time. What is the reason for waiting that long and is there a way to improve/get rid of this waiting time?Melih

anilerinch_a_intel · Answer

Hi

Hope things are going fine there.

For processing the loop efficiently in the FPGA in parallel fashion , loop unrolling can be applied. This ensures that different execution units are created to handle the summation. Without using this the parallel and scalar operation performance will not differ much , and parallel will also involve setup and related time.

So to extract performance out of the FPGA please perform a loop unrolling

#pragma unroll UNROLL_FACTOR

before the for loop exection.

You can try with different UNROLL_FACTORS , as twos multiples like 2 , 4 ,8 ,16 etc

and then compare the performance with the scalar version.

Please find an example below

cgh.single_task<class covariance>(

[=]()

{

/* Accessor related code HERE */

#pragma unroll UNROLL_FACTOR

For (int j=0;j<num_items, j++)

{

accessorC[j] = accessorA[j] + accessorB[j];

}

Thanks and Regards

Anil

Forum Discussion

DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

8 Replies

Recent Discussions

Agilex 7 FPGA Starter Kit with oneAPI Toolkit flow not detected over PCIe

MCTP over PCIe VDM routing to PMCI in OFS N6000 FIM configuration and datapath clarification

HLS Compiler 24.1 error - aocl-clang.exe - dll entry point not found

Error faced while executing on Agilex FPGA board....

AI Suite System Throughput Issue