GNL
New Contributor
5 years agoDevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one
Hello,
I tried the BaseKit-code-samples/DPC++Compiler/vector-add example as suggested in the DeveloperZone get started page:
https://devcloud.intel.com/oneapi/get-started/base-toolkit/
I measured the execution time of add_arrays_scalar() and add_arrays_parallel() functions with my RTM_START() and RTM_STOP() macros.
double getCPUFreq() {
#define BUFLEN 110
FILE* sysinfo;
char* ptr;
char buf[BUFLEN];
char key[] = "cpu MHz";
int keylen = sizeof(key) - 1;
double freq = -1;
sysinfo = fopen("/proc/cpuinfo", "r");
if (sysinfo != NULL) {
while (fgets(buf, BUFLEN, sysinfo) != NULL) {
if (!strncmp(buf, key, keylen)) {
ptr = strstr(buf, ":");
freq = atof(ptr + 1) * 1000000;
break;
}
}
fclose(sysinfo);
}
fprintf(stderr, "Freq = %f GHz\n", freq / 1000000000);
return freq;
}
#define RTM_START() start = (double)_rdtsc()
#define RTM_STOP() stop = (double)_rdtsc(); \
secs = ((double)(stop - start)) / (double)getCPUFreq();and my main function, where I took the measurements:
int main() {
double start = 0.0, stop = 0.0, secs = 0.0;
IntArray addend_1, addend_2, sum_scalar, sum_parallel;
// Initialize arrays with values from 0 to array_size-1
initialize_array(addend_1);
initialize_array(addend_2);
initialize_array(sum_scalar);
initialize_array(sum_parallel);
printf("CPU Freq = %10.2lf\n", (double)getCPUFreq());
start = (double)_rdtsc();
// Add arrays in scalar and in parallel
add_arrays_scalar(sum_scalar, addend_1, addend_2);
stop = (double)_rdtsc();
secs = ((double)(stop - start)) / (double)getCPUFreq();
printf("Scalar execution time = %2.6lf seconds\n", secs);
add_arrays_parallel(sum_parallel, addend_1, addend_2);
// Verify that the two sum arrays are equal
for (size_t i = 0; i < sum_parallel.size(); i++) {
if (sum_parallel[i] != sum_scalar[i]) {
std::cout << "fail" << std::endl;
return -1;
}
}
std::cout << "success" << std::endl;
std::cout << "MGUNAL TEST 5 DONE" << std::endl;
return 0;
}And finally my results:
########################################################################
# Date: Wed Mar 11 06:14:01 PDT 2020
# Job ID: 542630.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
CPU Freq = 954182000.00
Scalar execution time = 0.000085 seconds
Device: Intel(R) Gen9 HD Graphics NEO
Parallel execution time = 3.516770 seconds
success
MGUNAL TEST 5 DONE
########################################################################
# End of output for job 542630.v-qsvr-1.aidevcloud
# Date: Wed Mar 11 06:14:03 PDT 2020
########################################################################As seen above, parallel execution time is much more greater than the scaler execution time. So, I decided to use a SYCL host device instead of an accelerator.
/*
// FPGA device selector: Emulator or Hardware
#ifdef FPGA_EMULATOR
intel::fpga_emulator_selector device_selector;
#elif defined(FPGA)
intel::fpga_selector device_selector;
#else
// Initializing the devices queue with the default selector
// The device queue is used to enqueue the kernels and encapsulates
// all the states needed for execution
default_selector device_selector;
#endif
*/
host_selector device_selector;Parallel execution is still significantly slower than the scaler one..
########################################################################
# Date: Thu Mar 12 00:34:12 PDT 2020
# Job ID: 542971.v-qsvr-1.aidevcloud
# User: u38134
# Resources: neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
:: setvars has already been run. Skipping any further invocation. To force its re-execution, pass --force
./vector-add
CPU Freq = 1000065000.00
Scalar execution time = 0.000077 seconds
Device: SYCL host device
Parallel execution time = 0.318517 seconds
success
########################################################################
# End of output for job 542971.v-qsvr-1.aidevcloud
# Date: Thu Mar 12 00:34:14 PDT 2020
########################################################################I have no idea why the parallel execution takes so long? Am I doing something wrong?