The Block RAM usage is likely because of the cache the compiler automatically creates; this cache was clearly mentioned in the old report but they removed any mention of it in the new HTML report. Try also marking your "weights" buffer as volatile, or adding "--opt-arg -nocaching" (or -opt-arg=-nocaching for newer versions of the compiler) to the command-line to disable the cache and see if it makes a difference. However, at the end of the day, having so many kernels with an interface to host and memory will result in significant waste of logic and Block RAMs to support the interfaces, and this is apart from the fact that you will have to create a queue for every one of these kernels in the host and pay the coding and kernel launch overhead. I would instead recommend using autorun kernels with the automatic replication provided by the compiler to implement your PEs, and have only one non-autorun kernel to read data from memory and feed to the PEs and another non-autorun to read from PEs and write back to memory. If your PEs are regular and there is little to no rate imbalance between them, you can get away with very shallow channel depth of <20 which will not even use Block RAMs.
P.S. There is no guarantee the implementation mentioned in the paper uses a separate kernel for every PE; some people tend to also consider unrolled loop iterations as PEs... In other words, the Cvec, Wvec and Kvec mentioned in the paper might simply be unroll factors of the different loops in their code. Moreover, the paper says everything is implemented in OpenCL, but we don't really know...