I asked around and a custom on-chip memory as tightly coupled memory is only possible in Qsys.
I'm not sure if that message is specific to s2 or if it really should apply to both. Yesterday was the first time I've seen that message so I don't know what the story behind it is.
It's been so long since I've parameterized a memory with the OLD_DATA parameter value that I'm not sure if it enforces a read latency of 2. I suspect it doesn't because unless things have changed in the on-chip memory component I recall the additional register that is added when you select 2 cycles actually lives outside of the on-chip RAM block (gives more freedom to the place and route engine to be able to move it around).
If the mutex component overhead is too high due to the lack of a data cache perhaps a single cycle custom instruction shared by both CPUs implementing a faster lock would do the trick. It could actually be 0 cycles but that might hinder the CPU fmax.