If you are sure your bottleneck is going to be the PCI-E transfer, there is no point in accelerating your application on a PCI-E-attached accelerator, be it FPGA, GPU or anything else. Running it on a CPU could be the best solution since the PCI-E transfer will be avoided. Furthermore, all OpenCL-capable Stratix V and Arria 10 boards that I know of are limited to 8x PCI-E while you can at least get 16x PCI-E on nearly all GPUs from the past few years which means they will be a better option for you.
The reason why you cannot run your kernels simultaneously likely has very little to do with the board you are using. There is either something in your host code preventing your kernels from running in parallel or there is some limitation in Altera/Intel's OpenCL run-time which is board-independent. As I mentioned in my previous reply, I have personally run kernels in parallel successfully on the same board. You can find the design here (v8 kernel):
https://github.com/fpga-opencl-benchmarks/rodinia_fpga/tree/35b061f6b9c976dc44f86d6c2bd007c756c64349/opencl/lud/ocl
Are you sure the OpenCL implementation of your board only supports PCI-E Gen 2.0? My Stratix V board is installed on a machine that only supports Gen 2.0 and hence, it has to run at Gen 2.0, but my Arria 10 board runs at Gen 3.0 on a newer motherboard without any issue. Maybe your motherboard doesn't support Gen 3.0?
Terasic's documentation for the DE5-Net board are here:
https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=158&No=526&PartNo=4
Intel FPGA SDK for OpenCL's documents are here:
https://www.intel.com/content/www/us/en/programmable/products/design-software/embedded-software-developers/opencl/support.html
Intel HLS documents are here:
https://www.intel.com/content/www/us/en/programmable/products/design-software/high-level-design/intel-hls-compiler/support.html
These links include all the official documents available.