Hey,
I am currently working on running a conjugate gradient solver on an FPGA I used the reference you posted as a template and succeeded implementing the CG. Unfortunately I am facing some challenges with the performance. After further investigation I have found that the vector addition is performing much slower on the FPGA compared to runs on the GPU or CPU. While the matrix multiplication and the dot product were relatively okay, the overall performance is quite disappointing.
As I am not very experienced in optimizing code, I considered using oneMKL for the CG implementation to see if I can enhance the performance. However, when using oneMKL, I get the 'device not supported' error I posted already.