Hey amaltaha,
1- Yes, FPGAs are used for acceleration. If you are not getting an acceleration it can be that your application is not suitable for FPGA acceleration (e.g. a very quick FPGA computation compared to the overhead of offloading your computation to the FPGA), or that your code needs to be reworked to better suit the programming model for FPGAs.
Yes, the latency is expressed as clock cycles. This is explained in section 2.1.1.1 of the "FPGA Optimization Guide for Intel® oneAPI Toolkits : Developer Guide".
Setting the command line parameter "-Xsclock=500MHz" does not guarantee you the hardware is going to run at that speed. This is a clock target, not an achieved target. To see the achieved clock target, you should look in the generated report. The section 2.0 of the above quoted guide covers the analysis of the generated report. Section 4.1.1 explains the "-Xsclock" parameter.
3 - As I mentioned earlier, I'd suggest you retry your experiment using the "pragma unroll" compiler directive rather that parallel_for. This is demonstrated in the "Explore SYCL* Through Intel® FPGA Code Samples" webpage with hands-on code examples, and is also described in the optimization guide in section 4.6.8
Yohann