Forum Discussion

New Contributor

6 years ago

DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

Hello, I tried the BaseKit-code-samples/DPC++Compiler/vector-add example as suggested in the DeveloperZone get started page: https://devcloud.intel.com/oneapi/get-started/base-toolkit/ I measured ...

GNL

New Contributor

6 years ago

Hi,

When I try to increase the array size as suggested, I get a seg-fault on the cloud and a stack overflow exception, on my computer using intel oneapi base toolkit (Windows) (FPGA Emulator)

// Problem size for this example
//constexpr size_t array_size = 10000;
constexpr size_t array_size = 100000000;

I could not find a way to increase the stack size and then decided to use a vector instead of an array.

// Define the ARRAY type for use in this example
// typedef std::array<cl::sycl::cl_int, array_size> IntArray;
typedef std::vector<cl::sycl::cl_int> IntArray;

and, create vectors using the v(size, val) constructor.

IntArray addend_1(array_size, 0), addend_2(array_size, 0), sum_scalar(array_size, 0), sum_parallel(array_size, 0);

I tried for the same size, and checked my result vector is filled as expected. Then, I increased the array_size to 100M, and this is the result:

########################################################################
#      Date:           Mon Mar 30 06:10:14 PDT 2020
#    Job ID:           560373.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063809 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Parallel execution time = 1.711407 seconds
success
 
########################################################################
# End of output for job 560373.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:10:20 PDT 2020
########################################################################

The code blocks I measured here:

	RTM_START();
	// Add arrays in scalar and in parallel
	add_arrays_scalar(sum_scalar, addend_1, addend_2);
	RTM_STOP();
	printf("Scalar execution time = %2.6lf seconds\n", secs);
	
	RTM_START();
  	add_arrays_parallel(sum_parallel, addend_1, addend_2);
	RTM_STOP();
	printf("Parallel execution time = %2.6lf seconds\n", secs);

Finally, I measured it again partially and this is the result:

########################################################################
#      Date:           Mon Mar 30 06:45:23 PDT 2020
#    Job ID:           560414.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
Vector Size = 100000000
CPU Freq (GHz)= 4.000
Scalar execution time = 0.063832 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Init device queue = 1.273701 seconds
Buffer creation = 0.000002 seconds
Queue submission + Accessors + parallel execution time = 0.504512 seconds
Q wait_and_throw() execution time = 0.063734 seconds
Parallel execution time = 3.893053 seconds
success
 
########################################################################
# End of output for job 560414.v-qsvr-1.aidevcloud
# Date: Mon Mar 30 06:45:29 PDT 2020
########################################################################

The blocks I measured above, inside the add_arrays_parallel() function:

        RTM_START();
	std::unique_ptr<queue> q = initialize_device_queue();
	RTM_STOP();
	printf("Init device queue = %2.6lf seconds\n", secs);
 
	RTM_START();
	// The range of the arrays managed by the buffer
	range<1> num_items{ array_size };
 
	// Buffers are used to tell DPC++ which data will be shared between the host
	// and the devices because they usually don't share physical memory
	// The pointer that's being passed as the first parameter transfers ownership
	// of the data to DPC++ at runtime. The destructor is called when the buffer
	// goes out of scope and the data is given back to the std::arrays.
	// The second parameter specifies the range given to the buffer.
	buffer<cl_int, 1> addend_1_buf(addend_1.data(), num_items);
	buffer<cl_int, 1> addend_2_buf(addend_2.data(), num_items);
	buffer<cl_int, 1> sum_buf(sum.data(), num_items);
	RTM_STOP();
	printf("Buffer creation = %2.6lf seconds\n", secs);
 
	RTM_START();
	// queue::submit takes in a lambda that is passed in a command group handler
	// constructed at runtime. The lambda also contains a command group, which
	// contains the device-side operation and its dependencies
	q->submit([&](handler &h) {
		// Accessors are the only way to get access to the memory owned
		// by the buffers initialized above. The first get_access template parameter
		// specifies the access mode for the memory and the second template
		// parameter is the type of memory to access the data from; this parameter
		// has a default value
		auto addend_1_accessor = addend_1_buf.template get_access<dp_read>(h);
		auto addend_2_accessor = addend_2_buf.template get_access<dp_read>(h);
 
		// Note: Can use access::mode::discard_write instead of access::mode::write
		// because we're replacing the contents of the entire buffer.
		auto sum_accessor = sum_buf.template get_access<dp_write>(h);
 
		// Use parallel_for to run array addition in parallel. This executes the
		// kernel. The first parameter is the number of work items to use and the
		// second is the kernel, a lambda that specifies what to do per work item.
		// The template parameter ArrayAdd is used to name the kernel at runtime.
		// The parameter passed to the lambda is the work item id of the current
		// item.
		//
		// To remove the requirement to specify the kernel name you can enable
		// unnamed lamdba kernels with the option:
		//     dpcpp -fsycl-unnamed-lambda
		h.parallel_for<class ArrayAdd>(num_items, [=](id<1> i) {
			sum_accessor[i] = addend_1_accessor[i] + addend_2_accessor[i];
		});
	});
	RTM_STOP();
	printf("Queue submission + Accessors + parallel execution time = %2.6lf seconds\n", secs);
 
	RTM_START();
	// call wait_and_throw to catch async exception
	q->wait_and_throw();
	RTM_STOP();
	printf("Q wait_and_throw() execution time = %2.6lf seconds\n", secs);

There is still a big difference between Scalar execution time = 0.063832 seconds and Queue submission + Accessors + parallel execution time = 0.504512 seconds blocks.

Parallel exec seems 10 times slower than the scalar one. Did I miss something here?

GNL

New Contributor

6 years ago

In addition to this,

I found this about the sycl profiling events,

https://codeplay.com/portal/08-27-19-optimizing-your-sycl-code-using-profiling

https://gist.github.com/GeorgeWeb/ff908516bfe57f107bc36822dbdfe145

void profile(event_list& eventList, const time_point_list& startTimeList) {
    if (startTimeList.size() != eventList.size()) {
      std::string errMsg =
          "The number of events do not match the number of starting time "
          "points.";
      throw std::runtime_error("Profiling Error:\n" + errMsg);
    }
 
    T cgSubmissionTime = 0;
    T kernExecutionTime = 0;
    T realExecutionTime = 0;
    
    const auto eventCount = eventList.size();
    for (size_t i = 0; i < eventCount; ++i) {
      auto curEvent = eventList.at(i);
      curEvent.wait();
      auto curStartTime = startTimeList.at(i);
 
      const auto end = wall_clock_t::now();
      time_interval_t<T, std::milli> curRealExecutionTime = end - curStartTime;
      realExecutionTime += curRealExecutionTime.count();
 
      const auto cgSubmissionTimePoint = curEvent.template get_profiling_info<
          cl::sycl::info::event_profiling::command_submit>();
      const auto startKernExecutionTimePoint =
          curEvent.template get_profiling_info<
              cl::sycl::info::event_profiling::command_start>();
      const auto endKernExecutionTimePoint =
          curEvent.template get_profiling_info<
              cl::sycl::info::event_profiling::command_end>();
 
      cgSubmissionTime +=
          to_milli(startKernExecutionTimePoint - cgSubmissionTimePoint);
      kernExecutionTime +=
          to_milli(endKernExecutionTimePoint - startKernExecutionTimePoint);
    }
 
    set_command_group_submission_time(cgSubmissionTime);
    set_kernel_execution_time(kernExecutionTime);
    set_real_execution_time(realExecutionTime);
  }

And applied the steps to my vector addition code:

void VectorAddInDPCPP_2(const IntArray &VA, const IntArray &VB, IntArray &VC, queue deviceQueue) {
 
	// print out the device information used for the kernel code
	std::cout << "Device: " << deviceQueue.get_device().get_info<info::device::name>() << std::endl;
 
	// set up profiling data containers
	using wall_clock_t = std::chrono::high_resolution_clock;
	using time_point_t = std::chrono::time_point<wall_clock_t>;
 
	int profiling_iters = 1;
 
	std::vector<cl::sycl::event> eventList(profiling_iters);
	std::vector<time_point_t> startTimeList(profiling_iters);
 
	// create the range object for the arrays managed by the buffer
	range<1> num_items{ array_size };
 
	buffer<int, 1> bufferA(VA.data(), num_items);
	buffer<int, 1> bufferB(VB.data(), num_items);
	buffer<int, 1> bufferC(VC.data(), num_items);
 
	// Submit a kernel to the queue, returns a SYCL event
	for (size_t i = 0; i < profiling_iters; ++i) {
		startTimeList.at(i) = wall_clock_t::now();
		eventList.at(i) = deviceQueue.submit([&](handler &cgh) {
			auto accessorA = bufferA.get_access<dp_read>(cgh);
			auto accessorB = bufferB.get_access<dp_read>(cgh);
			auto accessorC = bufferC.get_access<dp_write>(cgh);
 
			cgh.parallel_for(num_items, [=](id<1> j) {
				accessorC[j] = accessorA[j] + accessorB[j];
			});
		});
	}
 
	// exec profile
	example_profiler<double> my_profiler(eventList, startTimeList);
	std::cout << "Kernel exec:              " << my_profiler.get_kernel_execution_time() << " msec" << std::endl;
	std::cout << "Cmd Group submission:     " << my_profiler.get_command_group_submission_time() << " msec" << std::endl;
	std::cout << "Real exec:                " << my_profiler.get_real_execution_time() << " msec" << std::endl;
}

Finally, this is the result for a 10K array, on the cloud:

########################################################################
#      Date:           Wed Apr  1 08:26:52 PDT 2020
#    Job ID:           561887.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
Device: Intel(R) Gen9 HD Graphics NEO
Kernel exec:              0.150332 msec
Cmd Group submission:     2.66203 msec
Real exec:                167.013 msec
VectorAddInDPCPP exec:    232.501 msec
Scalar exec:              0.095 msec
success
 
########################################################################
# End of output for job 561887.v-qsvr-1.aidevcloud
# Date: Wed Apr  1 08:26:57 PDT 2020
########################################################################

As you can see above, the queue submission + kernel execution time is much less than the real execution time. What is the reason for waiting that long and is there a way to improve/get rid of this waiting time?

Melih

Forum Discussion

DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process