First of all I'm guilty of not reading this entire thread but are you essentially trying to perform a vector operation (same operating across multiple data in memory)? If so and assuming the algorithm isn't particularly complicated a DMA engine with your transform hardware in the middle should suffice. For example if you wanted to perform a vector addition of two blocks in memory you just need this:
read master --> adder transform block --> write master
If you don't want to use a DMA but need memory masters to do this you could take a look at the read and write masters in the Qsys tutorial and just shove a custom controller on them. That's an overly simple algorithm and chances are it doesn't offer much in terms of speedup over a processor since you are performing two reads and one write for ever computation in the vector (i.e. memory limited). You realistically get a speedup when the algorithm is data independent or feed forward so that you can perform a bunch of calculations in a wide or pipelined data path.
A lot of the C to HDL products out there have gone the way of the dinosaur since it's pretty difficult for compilers to find parallelism in sequential C code. As far as I know the only commercial player left in that space is ImpulseC.