Usually there has been an RTOS that doles out processor time to all the functions.There are interrupts and context switch overhead involved. If it is economical to have parallel dedicated processors, then that goes away also.
I am not sure "The degree of parallelism scalable in the core." is exactly accurate. Each processor derives its performance from parallelism and each is the same, just different physical sizes. The next degree of parallelism is then from multiple processors running in parallel. So now we use the inherent FPGA function very much as HDL does. Also I think the overall operation is simplified.
I don't know enough about simulink and DSP.
Seems like the whole business is still based on the original concepts that are being pushed to the limits as chip density increases.
By the way, I do realize that some communication between processors is required. It should be something like "Here go do this" and "OK, I'm done".
As far as studies, I think there has been a lot of "I need something else, and I will try anything available that sounds possible."