DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

New Contributor

6 years ago

Hi again, it seems like ahead-of-time compiling drastically reduced our run-time performance for the device. On the other hand, it increased the host execution time..

This is the result for AOT compiling:

########################################################################
#      Date:           Tue Apr 14 05:18:06 PDT 2020
#    Job ID:           573459.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
Vector Size:                            100000
-------------------------------------------
Device: Intel(R) Gen9 HD Graphics NEO
Kernel exec:              0.088999 msec
Cmd Group submission:     3.9869 msec
Real exec:                9.85082 msec
VectorAddInDPCPP exec:    17.543 msec
Scalar exec:              1.785 msec
success
 
########################################################################
# End of output for job 573459.v-qsvr-1.aidevcloud
# Date: Tue Apr 14 05:18:12 PDT 2020
########################################################################

And this is the result for JIT compiling:

########################################################################
#      Date:           Tue Apr 14 05:36:10 PDT 2020
#    Job ID:           573469.v-qsvr-1.aidevcloud
#      User:           u38134
# Resources:           neednodes=1:gpu:ppn=2,nodes=1:gpu:ppn=2,walltime=06:00:00
########################################################################
 
:: setvars has already been run. Skipping any further invocation.  To force its re-execution, pass --force
./vector-add
Vector Size:                            100000
-------------------------------------------
Device: Intel(R) Gen9 HD Graphics NEO
Kernel exec:              0.149666 msec
Cmd Group submission:     3.13661 msec
Real exec:                175.897 msec
VectorAddInDPCPP exec:    243.523 msec
Scalar exec:              0.099 msec
success
 
########################################################################
# End of output for job 573469.v-qsvr-1.aidevcloud
# Date: Tue Apr 14 05:36:15 PDT 2020
########################################################################

And finally this my makefile for AOT:

CXX = dpcpp
#CXXFLAGS = -O2 -g
#LDFLAGS = -lOpenCL -lsycl
EXE_NAME = vector-add
SOURCES = src/vector-add.cpp
 
all: main
 
main:
	$(CXX) -fsycl-targets=spir64_gen-unknown-unknown-sycldevice -Xsycl-target-backend '-device skl' -o $(EXE_NAME) $(SOURCES)
 
run: 
	./$(EXE_NAME)
 
clean: 
	rm -rf $(EXE_NAME)

Did I skip something here? We increased the device kernel exec performance but why the host performance (Scalar exec above) is suffering now?

GNL

Forum Discussion

DevCloud: OneAPI vector-add example parallel execution is much slower than the scalar one

Recent Discussions

AI Suite - Spatial IP outputs wrong value

AI Suite - Is it possible to simulate the AI IP?

AI Suite - Streaming from HPS to DLA IP

Agilex 7 I-Series "aocl diagnose acl0" error following OFS

AI Suite - Custom model in the FPGA building process