Forum Discussion

New Contributor

6 years ago

DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

Hello, I tried the BaseKit-code-samples/DPC++Compiler/vector-add example as suggested in the DeveloperZone get started page: https://devcloud.intel.com/oneapi/get-started/base-toolkit/ I measured ...

AnilErinch_A_Intel

Frequent Contributor

6 years ago

Hi,

There is some overhead when using an accelerator, so you have make the array size much larger to see the advantage of parallel operation.

In the file vector_add.cpp: Change the following, rebuild and rerun the example.

static const size_t ARRAY_SIZE = 10000;

static const size_t ARRAY_SIZE = 10000000;

and let us know the results

Regards

Anil

GNL

New Contributor

6 years ago

Hi,

When I try to increase the array size as suggested, I get a seg-fault on the cloud and a stack overflow exception, on my computer using intel oneapi base toolkit (Windows) (FPGA Emulator)

// Problem size for this example
//constexpr size_t array_size = 10000;
constexpr size_t array_size = 100000000;

I could not find a way to increase the stack size and then decided to use a vector instead of an array.

// Define the ARRAY type for use in this example
// typedef std::array<cl::sycl::cl_int, array_size> IntArray;
typedef std::vector<cl::sycl::cl_int> IntArray;

and, create vectors using the v(size, val) constructor.

IntArray addend_1(array_size, 0), addend_2(array_size, 0), sum_scalar(array_size, 0), sum_parallel(array_size, 0);

I tried for the same size, and checked my result vector is filled as expected. Then, I increased the array_size to 100M, and this is the result:

########################################################################
#      Date:           Mon Mar 30 06:10:14 PDT 2020
#    Job ID:           560373.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063809 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Parallel execution time = 1.711407 seconds
success
 
########################################################################
# End of output for job 560373.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:10:20 PDT 2020
########################################################################

The code blocks I measured here:

	RTM_START();
	// Add arrays in scalar and in parallel
	add_arrays_scalar(sum_scalar, addend_1, addend_2);
	RTM_STOP();
	printf("Scalar execution time = %2.6lf seconds\n", secs);
	
	RTM_START();
  	add_arrays_parallel(sum_parallel, addend_1, addend_2);
	RTM_STOP();
	printf("Parallel execution time = %2.6lf seconds\n", secs);

Finally, I measured it again partially and this is the result:

########################################################################
#      Date:           Mon Mar 30 06:45:23 PDT 2020
#    Job ID:           560414.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063832 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Init device queue = 1.273701 seconds
Buffer creation = 0.000002 seconds
Queue submission + Accessors + parallel execution time = 0.504512 seconds
Q wait_and_throw() execution time = 0.063734 seconds
Parallel execution time = 3.893053 seconds
success
 
########################################################################
# End of output for job 560414.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:45:29 PDT 2020
########################################################################

The blocks I measured above, inside the add_arrays_parallel() function:

        RTM_START();
	std::unique_ptr<queue> q = initialize_device_queue();
	RTM_STOP();
	printf("Init device queue = %2.6lf seconds\n", secs);
 
	RTM_START();
	// The range of the arrays managed by the buffer
	range<1> num_items{ array_size };
 
	// Buffers are used to tell DPC++ which data will be shared between the host
	// and the devices because they usually don't share physical memory
	// The pointer that's being passed as the first parameter transfers ownership
	// of the data to DPC++ at runtime. The destructor is called when the buffer
	// goes out of scope and the data is given back to the std::arrays.
	// The second parameter specifies the range given to the buffer.
	buffer<cl_int, 1> addend_1_buf(addend_1.data(), num_items);
	buffer<cl_int, 1> addend_2_buf(addend_2.data(), num_items);
	buffer<cl_int, 1> sum_buf(sum.data(), num_items);
	RTM_STOP();
	printf("Buffer creation = %2.6lf seconds\n", secs);
 
	RTM_START();
	// queue::submit takes in a lambda that is passed in a command group handler
	// constructed at runtime. The lambda also contains a command group, which
	// contains the device-side operation and its dependencies
	q->submit([&](handler &h) {
		// Accessors are the only way to get access to the memory owned
		// by the buffers initialized above. The first get_access template parameter
		// specifies the access mode for the memory and the second template
		// parameter is the type of memory to access the data from; this parameter
		// has a default value
		auto addend_1_accessor = addend_1_buf.template get_access<dp_read>(h);
		auto addend_2_accessor = addend_2_buf.template get_access<dp_read>(h);
 
		// Note: Can use access::mode::discard_write instead of access::mode::write
		// because we're replacing the contents of the entire buffer.
		auto sum_accessor = sum_buf.template get_access<dp_write>(h);
 
		// Use parallel_for to run array addition in parallel. This executes the
		// kernel. The first parameter is the number of work items to use and the
		// second is the kernel, a lambda that specifies what to do per work item.
		// The template parameter ArrayAdd is used to name the kernel at runtime.
		// The parameter passed to the lambda is the work item id of the current
		// item.
		//
		// To remove the requirement to specify the kernel name you can enable
		// unnamed lamdba kernels with the option:
		//     dpcpp -fsycl-unnamed-lambda
		h.parallel_for<class ArrayAdd>(num_items, [=](id<1> i) {
			sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i];
		});
	});
	RTM_STOP();
	printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds\n", secs);
 
	RTM_START();
	// call wait_and_throw to catch async exception
	q->wait_and_throw();
	RTM_STOP();
	printf("Q wait_and_throw() execution time = %2.6lf seconds\n", secs);

There is still a big difference between Scalar execution time = 0.063832 seconds and Queue submission + Accessors + parallel execution time = 0.504512 seconds blocks.

Parallel exec seems 10 times slower than the scalar one. Did I miss something here?

GNL

New Contributor

6 years ago

In addition to this,

I found this about the sycl profiling events,

https://codeplay.com/portal/08-27-19-optimizing-your-sycl-code-using-profiling

https://gist.github.com/GeorgeWeb/ff908516bfe57f107bc36822dbdfe145

void profile(event_list& eventList, const time_point_list& startTimeList) {
    if (startTimeList.size() != eventList.size()) {
      std::string errMsg =
          "The number of events do not match the number of starting time "
          "points.";
      throw std::runtime_error("Profiling Error:\n" + errMsg);
    }
 
    T cgSubmissionTime = 0;
    T kernExecutionTime = 0;
    T realExecutionTime = 0;
    
    const auto eventCount = eventList.size();
    for (size_t i = 0; i < eventCount; ++i) {
      auto curEvent = eventList.at(i);
      curEvent.wait();
      auto curStartTime = startTimeList.at(i);
 
      const auto end = wall_clock_t::now();
      time_interval_t<T, std::milli> curRealExecutionTime = end - curStartTime;
      realExecutionTime += curRealExecutionTime.count();
 
      const auto cgSubmissionTimePoint = curEvent.template get_profiling_info<
          cl::sycl::info::event_profiling::command_submit>();
      const auto startKernExecutionTimePoint =
          curEvent.template get_profiling_info<
              cl::sycl::info::event_profiling::command_start>();
      const auto endKernExecutionTimePoint =
          curEvent.template get_profiling_info<
              cl::sycl::info::event_profiling::command_end>();
 
      cgSubmissionTime +=
          to_milli(startKernExecutionTimePoint - cgSubmissionTimePoint);
      kernExecutionTime +=
          to_milli(endKernExecutionTimePoint - startKernExecutionTimePoint);
    }
 
    set_command_group_submission_time(cgSubmissionTime);
    set_kernel_execution_time(kernExecutionTime);
    set_real_execution_time(realExecutionTime);
  }

And applied the steps to my vector addition code:

void VectorAddInDPCPP_2(const IntArray &VA, const IntArray &VB, IntArray &VC, queue deviceQueue) {
 
	// print out the device information used for the kernel code
	std::cout << "Device: " << deviceQueue.get_device().get_info<info::device::name>() << std::endl;
 
	// set up profiling data containers
	using wall_clock_t = std::chrono::high_resolution_clock;
	using time_point_t = std::chrono::time_point<wall_clock_t>;
 
	int profiling_iters = 1;
 
	std::vector<cl::sycl::event> eventList(profiling_iters);
	std::vector<time_point_t> startTimeList(profiling_iters);
 
	// create the range object for the arrays managed by the buffer
	range<1> num_items{ array_size };
 
	buffer<int, 1> bufferA(VA.data(), num_items);
	buffer<int, 1> bufferB(VB.data(), num_items);
	buffer<int, 1> bufferC(VC.data(), num_items);
 
	// Submit a kernel to the queue, returns a SYCL event
	for (size_t i = 0; i < profiling_iters; ++i) {
		startTimeList.at(i) = wall_clock_t::now();
		eventList.at(i) = deviceQueue.submit([&](handler &cgh) {
			auto accessorA = bufferA.get_access<dp_read>(cgh);
			auto accessorB = bufferB.get_access<dp_read>(cgh);
			auto accessorC = bufferC.get_access<dp_write>(cgh);
 
			cgh.parallel_for(num_items, [=](id<1> j) {
				accessorC[j] = accessorA[j] + accessorB[j];
			});
		});
	}
 
	// exec profile
	example_profiler<double> my_profiler(eventList, startTimeList);
	std::cout << "Kernel exec:              " << my_profiler.get_kernel_execution_time() << " msec" << std::endl;
	std::cout << "Cmd Group submission:     " << my_profiler.get_command_group_submission_time() << " msec" << std::endl;
	std::cout << "Real exec:                " << my_profiler.get_real_execution_time() << " msec" << std::endl;
}

Finally, this is the result for a 10K array, on the cloud:

########################################################################
#      Date:           Wed Apr  1 08:26:52 PDT 2020
#    Job ID:           561887.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
Device: Intel(R) Gen9 HD Graphics NEO
Kernel exec:              0.150332 msec
Cmd Group submission:     2.66203 msec
Real exec:                167.013 msec
VectorAddInDPCPP exec:    232.501 msec
Scalar exec:              0.095 msec
success
 
########################################################################
# End of output for job 561887.v-qsvr-1.aidevcloud
# Date: Wed Apr  1 08:26:57 PDT 2020
########################################################################

As you can see above, the queue submission + kernel execution time is much less than the real execution time. What is the reason for waiting that long and is there a way to improve/get rid of this waiting time?

Melih

Forum Discussion

DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process