Hello amaltha,
When compiling for emulation your program is going to run 100% on the CPU.
It is compiled just like if you were using any SYCL compiler (that don't target FPGAs).
So FPGA specific information such as the device name or the target frequency don't really matter.
The performance that you get from this program is not at all representative of the performance you will get when compiling for FPGAs.
When writing a code that will run on an FPGA, the optimizations that you make are different that the ones you make when targetting a CPU.
Therefore, the optimizations that you wrote for accelerating your program on an FPGA may very well have worse performance when compiled for emulation than your original program.
In this case it is hard to understand what you are comparing to what because you mention that you compared the parallel_for loop to a "CPU" execution, but from my understanding all the programs you launched ran on a CPU.