1 - I'm not sure what other information than frequency you are looking for. To specify a clock target for your compile, you can add -Xsclock=<clock target> to your compile command.
You can find all this documentation in the "FPGA Optimization Guide for Intel® oneAPI Toolkits : Developer Guide".
The clock setting option is described in 4.1.1.
2 - I can't tell from your description what is limiting your implementation. However, if you are comparing the iterative vs parallel versions, both on FPGA, then you should in theory get better throughput with the parallel version. I don't know how long your computation lasts, but it should run more than a few seconds to get the benefits of an offload to an FPGA.
I don't know what you mean by "The segmentation fault is caused by sorting".
I'm not sure I understand what you mean by the "host and kernel codes separated even with fpga run" - your kernel code is in the q.submit section, the host code is everything around it. Your host code will issue a call to the FPGA, you'll need to wait for the FPGA to return the results and continue your host computation.
3 - Yes, parallel_for means that all the iterations are executed at the same time, however in the general case they won't execute in one cycle.
I encourage you again to have a look at the "Explore SYCL* Through Intel® FPGA Code Samples" webpage that shows a lot of examples to familiarize yourself with these concepts, as well as teach you what are the good coding practices when developing for FPGAs.
There even is a tutorial for loop unrolling on FPGA, which demonstrates the recommended method: use a for loop with a "pragma unroll" compiler directive (so no parallel_for).
Cheers